Open severo opened 1 year ago
Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬
Also: the field partial
should be added to every entry of splits
in the /info response (or provided in another format, if we want to preserve the "info" as exported by the datasets library)
Maybe https://github.com/huggingface/moon-landing/pull/7079 (internal) is sufficient for now, ie: show a general warning for the dataset if some of its splits is partial.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬
I think we should store which splits are partial and which are complete. Opening an issue for that -> https://github.com/huggingface/dataset-viewer/issues/2809, and this one will depend on it.
Note that we can get this info per split already for free for most datasets:
partial
value is equal to the partial
value at config levelpartial
value is True if it matches the size generated by config-parquet-and-infoSo actually we should be able to retrieve most of the partial
values no ?
yes, it would be a good way to migrate the cache entries to the new schema instead of recomputing in #2809
For example, https://datasets-server.huggingface.co/size?dataset=c4 only provides a global
partial: true
field and the response does not explicit that the "train" split is partial, while the "test" one is complete.Every entry in
configs
andsplits
should also include its ownpartial
field, to be able to show this information in the viewer (selects)Endpoints where we want these extra fields: