huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Feature proposal: Stacking, potentially heterogeneous, datasets #7279

Open TimCares opened 2 weeks ago

TimCares commented 2 weeks ago

Introduction

Hello there, I noticed that there are two ways to combine multiple datasets: Either through datasets.concatenate_datasets or datasets.interleave_datasets. However, to my knowledge (please correct me if I am wrong) both approaches require the datasets that are combined to have the same features.

I think it would be a great idea to add support for combining multiple datasets that might not follow the same schema (i.e. have different features), for example an image and text dataset. That is why I propose a third function of the datasets.combine module called stack_datasets, which can be used to combine a list of datasets with (potentially) different features. This would look as follows:

>>> from datasets import stack_datasets
>>> image_dataset = ...
>>> next(iter(image_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
>>> text_dataset = ...
>>> next(iter(text_dataset))
{'text': "This is a test."}
>>> stacked = stack_datasets(datasets={'i_ds': image_dataset, 't_ds': text_dataset}, stopping_strategy='all_exhausted')
>>> next(iter(stacked))
{
'i_ds': {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
't_ds': {'text': "This is a test."}
}


Motivation

I motivate this by:

A: The fact that Pytorch offers a similar functionality under torch.utils.data.StackDataset (link).

B: In settings where one would like to e.g. train a Vision-Language model using an image-text dataset, an image dataset, and a text dataset, this functionality would offer a clean and intuitive solution to create multimodal datasets. I am aware that the aforementioned is also feasible without my proposed function, but I believe this offers a nice approach that aligns with existing functionality and is directly provided within the datasets package.

API

stack_datasets has two arguments: datasets and stopping_strategy.

datasets is a dictionary of either type Dict[str, Dataset] or Dict[str, IterableDatasets], a mixture is not allowed. It contains the names of the datasets (the keys) and the datasets themselves (the values) that should be stacked. Each item returned is a dictionary with one key-value pair for each dataset. The keys are the names of the datasets as provided in the argument datasets, and the values are the respective examples from the datasets.

stopping_strategy is the same as for interleave_datasets. If it is first_exhausted we stop if the smallest dataset runs out of examples, if it is all_exhausted we stop if all datasets ran out of examples at least once. For all_exhausted that means that we may visit examples from datasets multiple times.

Docs

I saw that there are multiple documentations and guides on the HuggingFace website that introduce concatenate_datasets and interleave_datasets, for example here. If this request is merged I would be willing to add the new functionality at the appropriate points in the documentation (if desired).

Tests

I also added some tests to ensure correctness. Some tests I wrote in tests/test_iterable_dataset.py run for both Dataset and IterableDataset even though tests for Dataset technically do not belong in this script, but I found that this was a nice way to cover more cases with mostly the same code.

Additional information

I tried to write the code in a way so that it is similar to that of concatenate_datasets and interleave_datasets. I’m open to feedback and willing to make adjustments based on your suggestions, so feel free to give me your take. :)