huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.2k stars 2.68k forks source link

Use work-stealing algorithm when parallel computing #5886

Open 1014661165 opened 1 year ago

1014661165 commented 1 year ago

Feature request

when i used Dataset.map api to process data concurrently, i found that it gets slower and slower as it gets closer to completion. Then i read the source code of arrow_dataset.py and found that it shard the dataset and use multiprocessing pool to execute each shard.It may cause the slowest task to drag out the entire program's execution time,especially when processing huge dataset.

Motivation

using work-stealing algorithm instead of sharding and parallel computing to optimize performance.

Your contribution

just an idea.

lhoestq commented 1 year ago

Alternatively we could set the number of shards to be a factor than the number of processes (current they're equal) - this way it will be less likely to end up with a shard that is significantly slower than all the other ones.