huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.88k stars 124 forks source link

Support Ray as executor #62

Open c21 opened 7 months ago

c21 commented 7 months ago

Ray (https://github.com/ray-project/ray) becomes popular choice of running distributed Python ML applications. Its Python interface is easy to scale up the workload from local laptop to distributed cluster. It would be good to add Ray as an executor backend (and we are happy to contribute).

Some more info related in this topic:

guipenedo commented 7 months ago

Thank you for your suggestion and specially for being willing to contribute! I thought ray would be more of an alternative (specially with ray data) rather than an environment where datatrove could run, but if you think you could add support as an executor (without making any changes to anything in the pipeline module), I'd be happy to have ray support in datatrove :)

simplew2011 commented 6 months ago

A similar tool, data Juicer, developed ray-executor: https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py