Question: does it handle concurrency?

vsoch commented 2 years ago

Hi! I'm browsing around to learn about real time machine learning and I stumbled on this repo. You mention using dicts for data, and also have an example of a flask app to update a mode l- https://riverml.xyz/latest/user-guide/reading-data/ but does river handle concurrency? E.g., I'm trying to figure out what best practices are for making a production server with the main purpose of receiving new data and updating a model. The logical thing is to keep some queue of new entries and run a batch update overnight, but I'm hoping there are more sophisticated "real time" methods that somehow map the model to the database. Thanks!

MaxHalford commented 2 years ago

Hi!

Hello 👋

I'm browsing around to learn about real time machine learning and I stumbled on this repo.

That's good news :)

but does river handle concurrency?

This question is difficult to answer without first talking about scope. The purpose of River is to implement online machine learning algorithms. That's basically it. River doesn't take care whatsoever of deploying a model into production. We believe that the latter should be the responsibility of a different library/tool/software.

So to answer your question per say: no, River doesn't handle concurrency. Concurrency, apart from being a interesting programming model, is a way to minimize runtime when dealing with I/O. But that's just not a concern River has. River isn't aware of where it's being run (e.g. a web server). But there's no reason why you couldn't use River (and only online model for that matter) in a concurrent environment. Creativity is the only limit!

One thing to be aware of is that online predictions can scale and be done in a concurrent fashion. However, you can't distribute the learning side of an online model. That part has to done by a single worker.

I'm trying to figure out what best practices are for making a production server with the main purpose of receiving new data and updating a model.

A little while ago, back when River was called "creme", I wrote Chantilly. This is a bare-bones Flask server for deploying online models. It's definitely not enterprise-grade, but some basic ideas are there.

There are a few startups and mature companies that are entering the realm of real-time machine learning and working on such platforms. I actually participated in one of these, but in the end it didn't work out. One of my big next goals is to revamp Chantilly entirely and publish a new tool that answers 80% of use cases.

Feel free to ask more questions! This kind of interest in online learning model deployment surfaces from time to time.

vsoch commented 2 years ago

One thing to be aware of is that online predictions can scale and be done in a concurrent fashion. However, you can't distribute the learning side of an online model. That part has to done by a single worker.

This is what I suspected! My quick and dirty "how should I do this" was to keep a queue of new data during the day, and have a batch job run at night to further train the model. Maybe that's not such a terrible idea after all! I think I just assumed that some company / effort was out there to figure out how to, perhaps, have an entire model in a database and have training more in real time.

A little while ago, back when River was called "creme", I wrote Chantilly. This is a bare-bones Flask server for deploying online models. It's definitely not enterprise-grade, but some basic ideas are there.

I will check this out! I am wanting to do something similar with Django, and probably not enterprise grade either, it's mostly for fun at the moment :)

There are a few startups and mature companies that are entering the realm of real-time machine learning and working on such platforms.

Off the top of your head, could you share a few?

One of my big next goals is to revamp Chantilly entirely and publish a new tool that answers 80% of use cases

Is this something you want to / can publicly talk about what you have in mind? Perhaps I could help! I think for my first attempt I'm going to try and combine a vector database (I found this one last night) https://milvus.io/ and see if I can represent a Doc2Vec model as a database table and work with vectors from the database instead. It's just a pickle so (famous last words?) maybe it won't be such a weird idea? :laughing:

MaxHalford commented 2 years ago

This is what I suspected! My quick and dirty "how should I do this" was to keep a queue of new data during the day, and have a batch job run at night to further train the model. Maybe that's not such a terrible idea after all!

KISS for the win! An alternative would be to train every time a new labelled sample is available, but my feeling is that's overkill for most people. Incrementally training with a mini-batch of data periodically is the way to go.

I think I just assumed that some company / effort was out there to figure out how to, perhaps, have an entire model in a database and have training more in real time.

There's things like HOGWILD! and federated learning, but those are too researchy and not production grade. Learning has to be considered a bottleneck that is hard to distribute as of now.

Off the top of your head, could you share a few?

Actually I don't know any companies building platforms for real-time learning that are publicly available. I did a stint for a stealth startup on the topic that Chip Huyen has started. So I got to see the lay of the land. One thing I know that the company that starts with Ama and finishes with Zon is working on the topic. But apparently they're going bogged down by politics and what not.

Is this something you want to / can publicly talk about what you have in mind?

Yep! As it turns out, I'm currently preparing a talk which I'm going to be giving at several venues over the next few months. The talk will be called "Online machine learning in practice". After presenting River and online learning, I'll go deeper and discuss the requirements for doing online machine learning in production.

Perhaps I could help!

So hopefully this talk will 1) help me confirm my ideas 2) allow me to discuss these ideas with people. After that I would like to start writing some code. I believe there are already some good principles in Chantilly. But the devil is in the detail. There are many questions to answer, such as how to A/B test models, monitor performance, etc. But overall I see a way forward!

I'm going to try and combine a vector database (I found this one last night) https://milvus.io/ and see if I can represent a Doc2Vec model as a database table and work with vectors from the database instead. It's just a pickle so (famous last words?) maybe it won't be such a weird idea? 😆

That sounds like a great idea. I think your example hints at the variety of use cases people may have. Here you want to tap into Milvius. A deployment tool should give you the flexibility to do just that. You should feel comfortable enough to deploy any kind of model. At the end of the day, once we nail down the technical requirements, it's mostly a question of good design.

With hindsight, I would say that the reason this tool hasn't been built yet is because it's at the frontier of what we're used to. Online learning is already a new topic for most people, so imagine how niche deploying an online model is! I've been doing online learning for roughly 3 years, and I still don't feel 100% at ease with these topics. The more pairs of eyes there are on the topic, the better.

Sorry for the rambling, I hope this discussion is helpful to you!

vsoch commented 2 years ago

It's extremely helpful! And thank you so much for all the details. I'm good to close the issue, and one more question - will your talk be at a venue that I could attend, or shared after (e.g., YouTube or similar?)

MaxHalford commented 2 years ago

The first talk is on February the 9th at PyData PDX. Let me know if you can't attend the event. I'm not sure which of the events will record the talk yet.

vsoch commented 2 years ago

Awesome! I will try to attend, and post back here if I can't make it (it's 6:30pm-8:30pm for me, so I'll watch with or right after dinner into the evening).

Thank you so much, looking forward to that next week! :partying_face:

vsoch commented 2 years ago

Thank you for the talk - it was great! I can't believe you were functioning so late / early in the morning the next day :laughing:

I've shared your slide at my lab, and I've also been a bit loud about the idea of online machine learning. The lab has a project they are working on that is labeled "online ML" but I believe it is doing batches and is not, so I'm trying to educate people around that. https://github.com/LLNL/apollo

I'm going to be exploring river probably in the coming weeks, months - will post an issue if I have questions or things to share! Thanks again!

MaxHalford commented 2 years ago

I can't believe you were functioning so late / early in the morning the next day 😆

Me neither haha, I won't be doing that again soon!

I've shared your slide at my lab, and I've also been a bit loud about the idea of online machine learning. The lab has a project they are working on that is labeled "online ML" but I believe it is doing batches and is not, so I'm trying to educate people around that. https://github.com/LLNL/apollo

Interesting! I'll try to dig a bit when I have some time.

I'm going to be exploring river probably in the coming weeks, months - will post an issue if I have questions or things to share! Thanks again!

Please do :). God speed!

online-ml / river

Question: does it handle concurrency? #834