Open alex-hh opened 1 month ago
perhaps concatenate_datasets can already be used to achieve almost the same effect?
concatenate_datasets
does the job when there is a finite number of repetitions, but in case of .repeat()
forever we need a new logic in iterable_dataset.py
Feature request
It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.
An IterableDataset.repeat(n) function could do this automatically
Motivation
This feature was discussed in this issue https://github.com/huggingface/datasets/issues/7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.
An additional benefit might be the simplification of the use of iterable datasets in a distributed setting: If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. https://github.com/huggingface/datasets/issues/6437, https://github.com/huggingface/datasets/issues/6594, https://github.com/huggingface/datasets/issues/6623, https://github.com/huggingface/datasets/issues/6719) can potentially be straightforwardly resolved by simply doing:
ids.repeat(None).take(n_samples_per_epoch)
Your contribution
I'm not familiar enough with the codebase to assess how straightforward this would be to implement.
If it might be very straightforward, I could possibly have a go.