huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
656 stars 67 forks source link

provide one "partial" field per entry in aggregated responses #1532

Open severo opened 1 year ago

severo commented 1 year ago

For example, https://datasets-server.huggingface.co/size?dataset=c4 only provides a global partial: true field and the response does not explicit that the "train" split is partial, while the "test" one is complete.

Every entry in configs and splits should also include its own partial field, to be able to show this information in the viewer (selects)

Endpoints where we want these extra fields:

severo commented 1 year ago

Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬

Also: the field partial should be added to every entry of splits in the /info response (or provided in another format, if we want to preserve the "info" as exported by the datasets library)

severo commented 1 year ago

Maybe https://github.com/huggingface/moon-landing/pull/7079 (internal) is sufficient for now, ie: show a general warning for the dataset if some of its splits is partial.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 2 months ago

Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬

I think we should store which splits are partial and which are complete. Opening an issue for that -> https://github.com/huggingface/dataset-viewer/issues/2809, and this one will depend on it.

lhoestq commented 2 months ago

Note that we can get this info per split already for free for most datasets:

So actually we should be able to retrieve most of the partial values no ?

severo commented 2 months ago

yes, it would be a good way to migrate the cache entries to the new schema instead of recomputing in #2809