huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Support of num_workers (multiprocessing) in map for IterableDataset #7193

Open getao opened 1 month ago

getao commented 1 month ago

Feature request

Currently, IterableDataset doesn't support setting num_worker in .map(), which results in slow processing here. Could we add support for it? As .map() can be run in the batch fashion (e.g., batch_size is default to 1000 in datasets), it seems to be doable for IterableDataset as the regular Dataset.

Motivation

Improving data processing efficiency

Your contribution

Testing

alex-hh commented 1 month ago

I was curious about the same - since map is applied on the fly I was assuming that setting num_workers>1 in DataLoader would effectively do the map in parallel, have you tried that?