Hello there,
I noticed that there are two ways to combine multiple datasets: Either through datasets.concatenate_datasets or datasets.interleave_datasets. However, to my knowledge (please correct me if I am wrong) both approaches require the datasets that are combined to have the same features.
I think it would be a great idea to add support for combining multiple datasets that might not follow the same schema (i.e. have different features), for example an image and text dataset. That is why I propose a third function of the datasets.combine module called stack_datasets, which can be used to combine a list of datasets with (potentially) different features. This would look as follows:
>>> from datasets import stack_datasets
>>> image_dataset = ...
>>> next(iter(image_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
>>> text_dataset = ...
>>> next(iter(text_dataset))
{'text': "This is a test."}
>>> stacked = stack_datasets(datasets={'i_ds': image_dataset, 't_ds': text_dataset}, stopping_strategy='all_exhausted')
>>> next(iter(stacked))
{
'i_ds': {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=555x416 at 0x313E79CD0> }
't_ds': {'text': "This is a test."}
}
Motivation
I motivate this by:
A: The fact that Pytorch offers a similar functionality under torch.utils.data.StackDataset (link).
B: In settings where one would like to e.g. train a Vision-Language model using an image-text dataset, an image dataset, and a text dataset, this functionality would offer a clean and intuitive solution to create multimodal datasets. I am aware that the aforementioned is also feasible without my proposed function, but I believe this offers a nice approach that aligns with existing functionality and is directly provided within the datasets package.
API
stack_datasets has two arguments: datasets and stopping_strategy.
datasets is a dictionary of either type Dict[str, Dataset] or Dict[str, IterableDatasets], a mixture is not allowed. It contains the names of the datasets (the keys) and the datasets themselves (the values) that should be stacked. Each item returned is a dictionary with one key-value pair for each dataset. The keys are the names of the datasets as provided in the argument datasets, and the values are the respective examples from the datasets.
stopping_strategy is the same as for interleave_datasets. If it is first_exhausted we stop if the smallest dataset runs out of examples, if it is all_exhausted we stop if all datasets ran out of examples at least once. For all_exhausted that means that we may visit examples from datasets multiple times.
Docs
I saw that there are multiple documentations and guides on the HuggingFace website that introduce concatenate_datasets and interleave_datasets, for example here. If this request is merged I would be willing to add the new functionality at the appropriate points in the documentation (if desired).
Tests
I also added some tests to ensure correctness. Some tests I wrote in tests/test_iterable_dataset.py
run for both Dataset and IterableDataset even though tests for Dataset technically do not belong in this script, but I found that this was a nice way to cover more cases with mostly the same code.
Additional information
I tried to write the code in a way so that it is similar to that of concatenate_datasets and interleave_datasets.
I’m open to feedback and willing to make adjustments based on your suggestions, so feel free to give me your take. :)
Introduction
Hello there, I noticed that there are two ways to combine multiple datasets: Either through
datasets.concatenate_datasets
ordatasets.interleave_datasets
. However, to my knowledge (please correct me if I am wrong) both approaches require the datasets that are combined to have the same features.I think it would be a great idea to add support for combining multiple datasets that might not follow the same schema (i.e. have different features), for example an image and text dataset. That is why I propose a third function of the
datasets.combine
module calledstack_datasets
, which can be used to combine a list of datasets with (potentially) different features. This would look as follows:Motivation
I motivate this by:
A: The fact that Pytorch offers a similar functionality under
torch.utils.data.StackDataset
(link).B: In settings where one would like to e.g. train a Vision-Language model using an image-text dataset, an image dataset, and a text dataset, this functionality would offer a clean and intuitive solution to create multimodal datasets. I am aware that the aforementioned is also feasible without my proposed function, but I believe this offers a nice approach that aligns with existing functionality and is directly provided within the
datasets
package.API
stack_datasets
has two arguments:datasets
andstopping_strategy
.datasets
is a dictionary of either typeDict[str, Dataset]
orDict[str, IterableDatasets]
, a mixture is not allowed. It contains the names of the datasets (the keys) and the datasets themselves (the values) that should be stacked. Each item returned is a dictionary with one key-value pair for each dataset. The keys are the names of the datasets as provided in the argumentdatasets
, and the values are the respective examples from the datasets.stopping_strategy
is the same as forinterleave_datasets
. If it isfirst_exhausted
we stop if the smallest dataset runs out of examples, if it isall_exhausted
we stop if all datasets ran out of examples at least once. Forall_exhausted
that means that we may visit examples from datasets multiple times.Docs
I saw that there are multiple documentations and guides on the HuggingFace website that introduce
concatenate_datasets
andinterleave_datasets
, for example here. If this request is merged I would be willing to add the new functionality at the appropriate points in the documentation (if desired).Tests
I also added some tests to ensure correctness. Some tests I wrote in tests/test_iterable_dataset.py run for both
Dataset
andIterableDataset
even though tests forDataset
technically do not belong in this script, but I found that this was a nice way to cover more cases with mostly the same code.Additional information
I tried to write the code in a way so that it is similar to that of
concatenate_datasets
andinterleave_datasets
. I’m open to feedback and willing to make adjustments based on your suggestions, so feel free to give me your take. :)