huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.06k stars 2.64k forks source link

load_dataset() should load all subsets, if no specific subset is specified #6951

Closed windmaple closed 3 months ago

windmaple commented 3 months ago

Feature request

Currently load_dataset() is forcing users to specify a subset. Example

from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")

ValueError                                Traceback (most recent call last)
[<ipython-input-10-c0cb49385da6>](https://localhost:8080/#) in <cell line: 2>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset("m-a-p/COIG-CQIA")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _create_builder_config(self, config_name, custom_features, **config_kwargs)
    582                     if not config_kwargs:
    583                         example_of_usage = f"load_dataset('{self.dataset_name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 584                         raise ValueError(
    585                             "Config name is missing."
    586                             f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"

ValueError: Config name is missing.
Please pick one among the available configs: ['chinese_traditional', 'coig_pc', 'exam', 'finance', 'douban', 'human_value', 'logi_qa', 'ruozhiba', 'segmentfault', 'wiki', 'wikihow', 'xhs', 'zhihu']
Example of usage:
    `load_dataset('coig-cqia', 'chinese_traditional')`

This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.

Motivation

Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.

Your contribution

Not sure since I'm not familiar w/ the lib src.

windmaple commented 3 months ago

@xianbaoqian

lhoestq commented 3 months ago

Feel free to open a PR in m-a-p/COIG-CQIA to define a default subset. Currently there is no default.

You can find some documentation at https://huggingface.co/docs/hub/datasets-manual-configuration#multiple-configurations

brthor commented 3 months ago

@lhoestq

Whilst having a default subset readily available (e.g. all) by the dataset author is an ideal solution, it is not always the reality.

Without the ability to fork the dataset, this can be problematic.

As far as I know, it is not possible at all to specify multiple subsets in a generalized programmatic way without hard coding subset names for a specific dataset.

Even the ability to fetch subset names and loop over them would be sufficient.

albertvillanova commented 3 months ago

Please note that each subset can have different feature columns, thus making it impossible to load them all into a unique Dataset instance.

That is why subsets were created: to support different but related datasets to coexist in a single dataset repository.

If you would like to programmatically get the list of subset names, you can use datasets.get_dataset_config_names: https://huggingface.co/docs/datasets/v2.20.0/en/load_hub#configurations