Closed windmaple closed 3 months ago
@xianbaoqian
Feel free to open a PR in m-a-p/COIG-CQIA
to define a default subset. Currently there is no default.
You can find some documentation at https://huggingface.co/docs/hub/datasets-manual-configuration#multiple-configurations
@lhoestq
Whilst having a default subset readily available (e.g. all
) by the dataset author is an ideal solution, it is not always the reality.
Without the ability to fork the dataset, this can be problematic.
As far as I know, it is not possible at all to specify multiple subsets in a generalized programmatic way without hard coding subset names for a specific dataset.
Even the ability to fetch subset names and loop over them would be sufficient.
Please note that each subset can have different feature columns, thus making it impossible to load them all into a unique Dataset instance.
That is why subsets were created: to support different but related datasets to coexist in a single dataset repository.
If you would like to programmatically get the list of subset names, you can use datasets.get_dataset_config_names
: https://huggingface.co/docs/datasets/v2.20.0/en/load_hub#configurations
Feature request
Currently load_dataset() is forcing users to specify a subset. Example
from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")
This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.
Motivation
Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.
Your contribution
Not sure since I'm not familiar w/ the lib src.