How to enable `.map()` pre-processing pipelines to support multi-node parallelism?

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

18.81k stars 2.6k forks source link

How to enable `.map()` pre-processing pipelines to support multi-node parallelism? #842

Open shangw-nvidia opened 3 years ago

shangw-nvidia commented 3 years ago

Hi,

Currently, multiprocessing can be enabled for the .map() stages on a single node. However, in the case of multi-node training, (since more than one node would be available) I'm wondering if it's possible to extend the parallel processing among nodes, instead of only 1 node running the .map() while the other node is waiting for it to finish?

Thanks!

lhoestq commented 3 years ago

Right now multiprocessing only runs on single node.

However it's probably possible to extend it to support multi nodes. Indeed we're using the multiprocess library from the pathos project to do multiprocessing in datasets, and pathos is made to support parallelism on several nodes. More info about pathos on the pathos repo.

If you're familiar with pathos or if you want to give it a try, it could be a nice addition to the library :)

VictorSanh commented 1 year ago

Curious to hear if anything on that side changed or if you suggestions to do it changed @lhoestq :)

For our use-case, we are entering the regime where trading a few more instances to save a few days would be nice :)

lhoestq commented 1 year ago

Currently for multi-node setups we're mostly going towards a nice integration with Dask. But I wouldn't exclude exploring pathos more at one point