huggingface / hub-docs

Docs of the Hugging Face Hub
http://hf.co/docs/hub
Apache License 2.0
268 stars 228 forks source link

Link to datasets YAML configuration page #1222

Open severo opened 6 months ago

severo commented 6 months ago

Link to https://huggingface.co/docs/datasets/v2.7.1/en/dataset_card#more-yaml-tags from https://huggingface.co/docs/hub/datasets-manual-configuration, to complement with all the possible values in README's YAML

severo commented 6 months ago

And give an example of each supported feature type in the YAML config. See https://discuss.huggingface.co/t/appropriate-yaml-for-dataset-info-list-float/74418 for example: I think we currently have no reference to share to the user.

lappemic commented 3 months ago

Hey @severo, i just had a look into this. As far as i can see, there is no section about "More YAML tags" anymore in the Dataset docs. Is this correct? If yes, is this issue outdated or do i miss something?

severo commented 3 months ago

Indeed, it has been removed in https://github.com/huggingface/datasets/pull/5470#discussion_r1088471903

The spec is here: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1

severo commented 3 months ago

Somewhat related: discussion about the spec: https://github.com/huggingface/dataset-viewer/issues/2639

Also: should we just redirect to the spec (https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), or should we create a dedicated doc page for this? Adding the link would already by a good step forward.

severo commented 3 months ago

Also: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1 is outdated:

configs:  # Optional for datasets with multiple configurations like glue.
- {config_0}  # Example for glue: sst2
- {config_1}  # Example for glue: cola

It does not respect the current format: https://huggingface.co/docs/hub/datasets-manual-configuration.

Ideally, it should be the reference, with more details than https://huggingface.co/docs/hub/datasets-manual-configuration, not the other way.

cc @polinaeterna for example if you want to look at it

lappemic commented 3 months ago

Adding the link would already by a good step forward.

Shall i start out with this and have a look where it leads us @severo? Or would you suggest a different approachch?

severo commented 3 months ago

Hmmm, I think we have to improve the spec first. Then, link to it from the docs page, otherwise the link would not bring much value.

lappemic commented 3 months ago

Let me know if i can help out somehow! Would be down for it. 😄

severo commented 3 months ago

Do you want to work on a PR to improve the spec https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1 ? The idea is to add the structure of the configs: field, to match https://huggingface.co/docs/hub/datasets-manual-configuration at least (config_name, data_files, etc). Some more fields can be passed, if I'm not wrong (it's defined in https://github.com/huggingface/datasets, but @polinaeterna knows these details better than I)

lappemic commented 3 months ago

I would love to! Will open a PR for discussion.

lappemic commented 3 months ago

Since the spec is improved, shall i open a PR to link the YAML configuratino page?