huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Add IterableDataset.shard() #7252

Closed lhoestq closed 3 weeks ago

lhoestq commented 3 weeks ago

Will be useful to distribute a dataset across workers (other than pytorch) like spark

I also renamed .n_shards -> .num_shards for consistency and kept the old name for backward compatibility. And a few changes in internal functions for consistency as well (rank, world_size -> num_shards, index)

Breaking change: the new default for contiguous in Dataset.shard() is True, but imo not a big deal since I couldn't find any usage of contiguous=False internally (we always do contiguous=True for map-style datasets since its more optimized) or in the wild

HuggingFaceDocBuilderDev commented 3 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.