huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Support `pipeline` argument in inspect.py functions #4803

Open severo opened 2 years ago

severo commented 2 years ago

Is your feature request related to a problem? Please describe.

The wikipedia dataset requires a pipeline argument to build the list of splits:

https://huggingface.co/datasets/wikipedia/blob/main/wikipedia.py#L937

But this is currently not supported in get_dataset_config_info:

https://github.com/huggingface/datasets/blob/main/src/datasets/inspect.py#L373-L375

which is called by other functions, e.g. get_dataset_split_names.

Additional context

The dataset viewer is not working out-of-the-box on wikipedia for this reason:

https://huggingface.co/datasets/wikipedia/viewer

Capture d’écran 2022-08-08 à 12 01 16
severo commented 1 year ago

Now: the preview (first-rows) works, but not the conversion to parquet. See https://huggingface.co/datasets/wikipedia/viewer/20220301.de/train

_split_generators() missing 1 required positional argument: 'pipeline'

Error code:   UnexpectedError