huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
640 stars 65 forks source link

[Config-parquet-and-info] Compute estimated dataset info #2906

Closed lhoestq closed 2 weeks ago

lhoestq commented 3 weeks ago

This will be useful to show the estimate number of rows of datasets that are partially converted to Parquet

I added estimated_dataset_info to theparquet-and-info response. It contains estimations of:

Then we'll be able to propagate this info to the size jobs and then to hub-cache.

I'll run it on some datasets to check if it works fine and when it's ok I'll re-run the jobs for datasets for which we need to estimate the number of rows.

TODO:

lhoestq commented 3 weeks ago

Actually I still have to compute the sizes or un-read files

lhoestq commented 2 weeks ago

It was a bit of a rabbit hole to do this in a clean way, but it's all good now ! Let me know if you have comments :)

I'll enable it only for allenai/c4 for now probably to try it out

lhoestq commented 2 weeks ago

I added some comments, enabled on allenai/c4 and datasets-maintainers org and added a migration to add estimated_dataset_info to existing cache entries

severo commented 2 weeks ago

Nice, thanks. Feel free to merge and deploy!