huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Add repeat() for iterable datasets #7192

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Feature request

It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.

An IterableDataset.repeat(n) function could do this automatically

Motivation

This feature was discussed in this issue https://github.com/huggingface/datasets/issues/7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.

An additional benefit might be the simplification of the use of iterable datasets in a distributed setting: If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. https://github.com/huggingface/datasets/issues/6437, https://github.com/huggingface/datasets/issues/6594, https://github.com/huggingface/datasets/issues/6623, https://github.com/huggingface/datasets/issues/6719) can potentially be straightforwardly resolved by simply doing:

ids.repeat(None).take(n_samples_per_epoch)

Your contribution

I'm not familiar enough with the codebase to assess how straightforward this would be to implement.

If it might be very straightforward, I could possibly have a go.

alex-hh commented 1 month ago

perhaps concatenate_datasets can already be used to achieve almost the same effect?

lhoestq commented 1 month ago

concatenate_datasets does the job when there is a finite number of repetitions, but in case of .repeat() forever we need a new logic in iterable_dataset.py