huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.11k stars 2.66k forks source link

Dataset sizes are in MiB instead of MB in dataset cards #5708

Closed albertvillanova closed 9 months ago

albertvillanova commented 1 year ago

As @severo reported in an internal discussion (https://github.com/huggingface/moon-landing/issues/5929):

Now we show the dataset size:

But, even if the size is the same, we see a mismatch because the viewer shows MB, while the info from the README generally shows MiB (even if it's written MB -> https://huggingface.co/datasets/blimp/blob/main/README.md?code=true#L1932)

Capture d’écran 2023-04-04 à 10 16 01

TODO: Values to be fixed in: Size of downloaded dataset files:, Size of the generated dataset: and Total amount of disk used:

albertvillanova commented 1 year ago

Example of bulk edit: https://huggingface.co/datasets/aeslc/discussions/5

julien-c commented 1 year ago

looks great!

Do you encode the fact that you've already converted a dataset? (to not convert it twice) or do you base yourself on the info contained in dataset_info

albertvillanova commented 1 year ago

I am only looping trough the dataset cards, assuming that all of them were created with MiB.

I agree we should only run the bulk edit once for all canonical datasets: I'm using a for-loop over canonical datasets.

julien-c commented 1 year ago

yes, worst case, we have this in structured data:

image
albertvillanova commented 1 year ago

I have just included as well the conversion from MB to GB if necessary. See:

severo commented 1 year ago

Nice. Is it another loop? Because in https://huggingface.co/datasets/amazon_us_reviews/discussions/2/files we have 32377.29 MB for example

albertvillanova commented 1 year ago

First, I tested some batches to check the changes made. Then I incorporated the MB to GB conversion. Now I'm running the rest.

albertvillanova commented 1 year ago

The bulk edit parsed 751 canonical datasets and updated 166.

severo commented 1 year ago

Thanks a lot!

The sizes now match as expected!

Capture d’écran 2023-04-05 à 16 10 15
albertvillanova commented 1 year ago

I made another bulk edit of ancient canonical datasets that were moved to community organization. I have parsed 11 datasets and opened a PR on 3 of them:

severo commented 1 year ago

should we force merge the PR and close this issue?

albertvillanova commented 9 months ago

I merged the PRs for "scicite" and "scifact".