Open alex-hh opened 1 week ago
Hi ! You can already configure the README.md to have multiple sets of splits, e.g.
configs:
- config_name: my_first_set_of_split
data_files:
- split: train
path: *.csv
- config_name: my_second_set_of_split
data_files:
- split: train
path: train-*.csv
- split: test
path: test-*.csv
Hi - I had something slightly different in mind:
Currently the yaml splits specified like this only allow specifying which filenames to pass to each split. But what if I have a situation where I know which individual training examples I want to put in each split.
I could build split-specific files, however for large datasets with overlapping (e.g. multiple sets of) splits this could result in significant duplication of data.
I can see that this could actually be very much intended (i.e. to discourage overlapping splits), but wondered whether some support for handling splits based on individual identifiers is something that could be considered.
This is not supported right now :/ Though you can load the data in two steps like this
from datasets import load_dataset
full_dataset = load_dataset("username/dataset", split="train")
my_first_set_indices = load_dataset("username/dataset", "my_first_set_of_split", split="train")
my_first_set = full_dataset.select(my_first_set_indices["indices"])
you can create such a dataset by adapting this code for example
# upload the full dataset
full_dataset.push_to_hub("username/dataset")
# then upload the indices for each set
DatasetDict({
"train": Dataset.from_dict({"indices": [0, 1, 2, 3]}),
"test": Dataset.from_dict({"indices": [4, 5]}),
}).push_to_hub("username/dataset", "my_first_set_of_split")
Feature request
As far as I understand, automated construction of splits for hub datasets is currently based on either file names or directory structure (as described here)
It would seem to be pretty useful to also allow splits to be based on identifiers of individual examples
This could be configured like {"split_name": {"column_name": [column values in split]}}
(This in turn requires unique 'index' columns, which could be explicitly supported or just assumed to be defined appropriately by the user).
I guess a potential downside would be that shards would end up spanning different splits - is this something that can be handled somehow? Would this only affect streaming from hub?
Motivation
The main motivation would be that all data files could be stored in a single directory, and multiple sets of splits could be generated from the same data. This is often useful for large datasets with multiple distinct sets of splits.
This could all be configured via the README.md yaml configs
Your contribution
May be able to contribute if it seems like a good idea