Open datametrician opened 4 years ago
Q: What's to stop you using https://rapids.ai/ with faust at the moment?
I couldn't find any docs on how Faust supports GPUs. Is it GPU aware?
Not directly, but I'm suggesting that you can just use a library in your agents to offload processing as required - the magic of faust is that it's more just a python library than a "way-of-life" framework such as spark/flink etc, so it's relatively easy to combine with other libraries.
Cool. Dask is very similar. The only reason why integration is sometimes necessary is that with RAPIDS.ai, it's not about offloading, it's about making GPUs the primary form of computing. That said, we'll kick the tires and see what's possible.
I took a deep dive into what would be needed to practically make this happen and wanted to share my findings here. First a few things that are worth mentioning about dataframes for GPUs.
1) Not event based like Faust, dataframe (batch) focused instead of event = row focused, dataframes could be considered an "event" to fit the Faust paradigm 2) Getting data to the GPU and keeping it there without lots of movement is important to get the big speedups, gpu pointers instead of copying data around.
With that in mind here are the steps I would consider needed to make this happen.
1) Create a GPU Faust driver
to ingest Kafka messages directly to GPU. We have one of these already in RAPIDs so this portion would effectively be a Driver that serves as a manager for cudf_kafka
as we call it.
2) Faust Transport
. Same as above just some small changes to enable using cudf_kafka
3) Support for a Faust 'Table Iterator' that instead iterators a CUDF dataframe residing on the GPU
I believe with those 3 steps Faust could be expanded to accept GPU memory pointers as events
and then those pointers passed to the user defined @app.agent
where the dataframes could be accessed and used as desired without the overhead of additional copies but at the same time gaining the huge benefits offered by the Faust framework.
Would be really interested to hear others thoughts on this.
This is more of a feature request. RAPIDS, https://rapids.ai/, provides GPU accelerate libraries that follow PyData API We have a stream library, custreamz (https://medium.com/rapids-ai/gpu-accelerated-stream-processing-with-rapids-f2b725696a61) which does something similar to Faust. Given Flink recently added GPU support, https://flink.apache.org/news/2020/08/06/external-resource.html, I was wondering if Faust would be willing to do the same so people could use GPU for stream processing.
For high-throughput problems, GPUs have proven to be cheaper in production, and open the doors for better deep learning and ML inferencing on python streams.