huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

[interleave_dataset] sample batches from a single source at a time #7122

Open memray opened 3 months ago

memray commented 3 months ago

Feature request

interleave_dataset and RandomlyCyclingMultiSourcesExamplesIterable enable us to sample data examples from different sources. But can we also sample batches in a similar manner (each batch only contains data from a single source)?

Motivation

Some recent research [1, 2] shows that source homogenous batching can be helpful for contrastive learning. Can we add a function called RandomlyCyclingMultiSourcesBatchesIterable to support this functionality?

Your contribution

I can contribute a PR. But I wonder what the best way is to test its correctness and robustness.