huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Add repeat method to datasets #7198

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Following up on discussion in #6623 and #7198 I thought this would be pretty useful for my case so had a go at implementing.

My main motivation is to be able to call iterable_dataset.repeat(None).take(samples_per_epoch) to safely avoid timeout issues in a distributed training setting. This would provide a straightforward workaround for several open issues related to this situation: https://github.com/huggingface/datasets/issues/6437, https://github.com/huggingface/datasets/issues/6594, https://github.com/huggingface/datasets/issues/6623, https://github.com/huggingface/datasets/issues/6719.

@lhoestq let me know if this looks on the right track!