Closed lhoestq closed 2 weeks ago
Actually I still have to compute the sizes or un-read files
It was a bit of a rabbit hole to do this in a clean way, but it's all good now ! Let me know if you have comments :)
I'll enable it only for allenai/c4 for now probably to try it out
I added some comments, enabled on allenai/c4 and datasets-maintainers org and added a migration to add estimated_dataset_info to existing cache entries
Nice, thanks. Feel free to merge and deploy!
This will be useful to show the estimate number of rows of datasets that are partially converted to Parquet
I added
estimated_dataset_info
to theparquet-and-info
response. It contains estimations of:Then we'll be able to propagate this info to the
size
jobs and then tohub-cache
.I'll run it on some datasets to check if it works fine and when it's ok I'll re-run the jobs for datasets for which we need to estimate the number of rows.
TODO: