ML - online and offline modes

j7zAhU commented 1 year ago

Hello,

I have been looking into MN to see whether it is appropriate to my use case.

I have microsecond log data which will be used as an input to a ML classifier. I would like to use the same code when batch processing historical data as I do when the classifier is running live. The event stream system in use is proprietary.

Is MN suitable? Many thanks :)

MainRo commented 1 year ago

Yes, one of the main goals of Maki-Nage is to mutualize as much code as possible for stream and batch processing (and this is how we use it). You may have seen that for now, the Maki-Nage package focuses mainly on the streaming use-case, and more precisely on Kafka. However, the connector API can be used to plug virtually any source of data.

That being said, Maki-Nage is still in an early stage and you should be aware of this before using it in production use-case:

The whole code is written in python, you should consider using pypy to get the best performances. Good vs bad performance is really dependent on your context and expectations, so there is no clear answer for this part.
We still regularly fix bugs in error management, and we did not implement all we want in that aspect. Debugging issues can be cumbersome.

However, I obviously encourage you to give it a try and see if it may fit your needs. We are interested in any feedback. We typically use it as Kafka micro-services and Kubeflow pipelines components.

Also, if you are ready to use the foundation of maki-nage, you can write your own application/library directly with rxsci. The advantage of this is that for batch processing, you can parallelize your processing via ray (see rxray). The aim is to integrate ray in a seamless way into Maki-Nage but we are still far from it.

If you need an already mature solution, then apache beam is undoubtedly a solution to consider.

j7zAhU commented 1 year ago

Thank you kindly. I will investigate these suggestions.

maki-nage / makinage

ML - online and offline modes #11