huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
696 stars 79 forks source link

Help dataset owner to chose between configs and splits? #2721

Open severo opened 6 months ago

severo commented 6 months ago

See https://huggingface.slack.com/archives/C039P47V1L5/p1713172703779839

Am I correct in assuming that if you specify a "config" in a dataset, only the given config is downloaded, but if you specify a split, all splits for that config are downloaded? I came across it when using facebook's belebele (https://huggingface.co/datasets/facebook/belebele). Instead of a config for each language, they use a split for each language, but that seems to mean that the full dataset is downloaded, even if you select just one language split.

For languages, we recommend using different configs, not splits.

Maybe we should also show a warning / open a PR/discussion? when a dataset contains more than 5 splits, hinting that it might be better to use configs?

severo commented 6 months ago

See a discussion on the Hub: https://huggingface.co/datasets/facebook/belebele/discussions/5

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.