Dataset sizes are in MiB instead of MB in dataset cards

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.11k stars 2.66k forks source link

Dataset sizes are in MiB instead of MB in dataset cards #5708

Closed albertvillanova closed 9 months ago

albertvillanova commented 1 year ago

As @severo reported in an internal discussion (https://github.com/huggingface/moon-landing/issues/5929):

Now we show the dataset size:

from the dataset card (in the side column)
from the datasets-server (in the viewer)

But, even if the size is the same, we see a mismatch because the viewer shows MB, while the info from the README generally shows MiB (even if it's written MB -> https://huggingface.co/datasets/blimp/blob/main/README.md?code=true#L1932)

TODO: Values to be fixed in: Size of downloaded dataset files:, Size of the generated dataset: and Total amount of disk used:

[x] Bulk edit on the Hub to fix this in all canonical datasets
[x] Bulk PR on the Hub to fix ancient canonical datasets that were moved to organizations

albertvillanova commented 1 year ago

Example of bulk edit: https://huggingface.co/datasets/aeslc/discussions/5

julien-c commented 1 year ago

looks great!

Do you encode the fact that you've already converted a dataset? (to not convert it twice) or do you base yourself on the info contained in dataset_info

albertvillanova commented 1 year ago

I am only looping trough the dataset cards, assuming that all of them were created with MiB.

I agree we should only run the bulk edit once for all canonical datasets: I'm using a for-loop over canonical datasets.

julien-c commented 1 year ago

yes, worst case, we have this in structured data:

albertvillanova commented 1 year ago

I have just included as well the conversion from MB to GB if necessary. See:

severo commented 1 year ago

Nice. Is it another loop? Because in https://huggingface.co/datasets/amazon_us_reviews/discussions/2/files we have 32377.29 MB for example

albertvillanova commented 1 year ago

First, I tested some batches to check the changes made. Then I incorporated the MB to GB conversion. Now I'm running the rest.

albertvillanova commented 1 year ago

The bulk edit parsed 751 canonical datasets and updated 166.

severo commented 1 year ago

Thanks a lot!

The sizes now match as expected!

albertvillanova commented 1 year ago

I made another bulk edit of ancient canonical datasets that were moved to community organization. I have parsed 11 datasets and opened a PR on 3 of them:

[x] "allenai/scicite": https://huggingface.co/datasets/allenai/scicite/discussions/3
[x] "allenai/scifact": https://huggingface.co/datasets/allenai/scifact/discussions/2
[x] "dair-ai/emotion": https://huggingface.co/datasets/dair-ai/emotion/discussions/6

severo commented 1 year ago

should we force merge the PR and close this issue?

albertvillanova commented 9 months ago

I merged the PRs for "scicite" and "scifact".