huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Add with_rank to Dataset.from_generator #7213

Open muthissar opened 1 month ago

muthissar commented 1 month ago

Feature request

Add with_rank to Dataset.from_generator similar to Dataset.map and Dataset.filter.

Motivation

As for Dataset.map and Dataset.filter, this is useful when creating cache files using multi-GPU, where the rank can be used to select GPU IDs. For now, rank can be added in the gen_kwars argument; however, this, in turn, includes the rank when computing the fingerprint.

Your contribution

Added #7199 which passes rank based on the job_id set by num_proc.