Support for identifier-based automated split construction

alex-hh commented 1 week ago

Feature request

As far as I understand, automated construction of splits for hub datasets is currently based on either file names or directory structure (as described here)

It would seem to be pretty useful to also allow splits to be based on identifiers of individual examples

This could be configured like {"split_name": {"column_name": [column values in split]}}

(This in turn requires unique 'index' columns, which could be explicitly supported or just assumed to be defined appropriately by the user).

I guess a potential downside would be that shards would end up spanning different splits - is this something that can be handled somehow? Would this only affect streaming from hub?

Motivation

The main motivation would be that all data files could be stored in a single directory, and multiple sets of splits could be generated from the same data. This is often useful for large datasets with multiple distinct sets of splits.

This could all be configured via the README.md yaml configs

Your contribution

May be able to contribute if it seems like a good idea

lhoestq commented 3 days ago

Hi ! You can already configure the README.md to have multiple sets of splits, e.g.

configs:
- config_name: my_first_set_of_split
  data_files:
  - split: train
    path: *.csv
- config_name: my_second_set_of_split
  data_files:
  - split: train
    path: train-*.csv
  - split: test
    path: test-*.csv

alex-hh commented 3 days ago

Hi - I had something slightly different in mind:

Currently the yaml splits specified like this only allow specifying which filenames to pass to each split. But what if I have a situation where I know which individual training examples I want to put in each split.

I could build split-specific files, however for large datasets with overlapping (e.g. multiple sets of) splits this could result in significant duplication of data.

I can see that this could actually be very much intended (i.e. to discourage overlapping splits), but wondered whether some support for handling splits based on individual identifiers is something that could be considered.

lhoestq commented 2 days ago

This is not supported right now :/ Though you can load the data in two steps like this

from datasets import load_dataset

full_dataset = load_dataset("username/dataset", split="train")
my_first_set_indices = load_dataset("username/dataset", "my_first_set_of_split", split="train")

my_first_set = full_dataset.select(my_first_set_indices["indices"])

you can create such a dataset by adapting this code for example


# upload the full dataset
full_dataset.push_to_hub("username/dataset")
# then upload the indices for each set
DatasetDict({
    "train": Dataset.from_dict({"indices": [0, 1, 2, 3]}),
    "test": Dataset.from_dict({"indices": [4, 5]}),
}).push_to_hub("username/dataset", "my_first_set_of_split")

huggingface / datasets