Open 1014661165 opened 1 year ago
Alternatively we could set the number of shards to be a factor than the number of processes (current they're equal) - this way it will be less likely to end up with a shard that is significantly slower than all the other ones.
Feature request
when i used Dataset.map api to process data concurrently, i found that it gets slower and slower as it gets closer to completion. Then i read the source code of arrow_dataset.py and found that it shard the dataset and use multiprocessing pool to execute each shard.It may cause the slowest task to drag out the entire program's execution time,especially when processing huge dataset.
Motivation
using work-stealing algorithm instead of sharding and parallel computing to optimize performance.
Your contribution
just an idea.