Use work-stealing algorithm when parallel computing

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Apache License 2.0

19.2k stars 2.68k forks source link

Feature request

when i used Dataset.map api to process data concurrently, i found that it gets slower and slower as it gets closer to completion. Then i read the source code of arrow_dataset.py and found that it shard the dataset and use multiprocessing pool to execute each shard.It may cause the slowest task to drag out the entire program's execution time,especially when processing huge dataset.

Motivation

using work-stealing algorithm instead of sharding and parallel computing to optimize performance.

Your contribution

just an idea.

huggingface / datasets