Closed albertvillanova closed 9 months ago
Example of bulk edit: https://huggingface.co/datasets/aeslc/discussions/5
looks great!
Do you encode the fact that you've already converted a dataset? (to not convert it twice) or do you base yourself on the info contained in dataset_info
I am only looping trough the dataset cards, assuming that all of them were created with MiB.
I agree we should only run the bulk edit once for all canonical datasets: I'm using a for-loop over canonical datasets.
yes, worst case, we have this in structured data:
I have just included as well the conversion from MB to GB if necessary. See:
Nice. Is it another loop? Because in https://huggingface.co/datasets/amazon_us_reviews/discussions/2/files we have 32377.29 MB
for example
First, I tested some batches to check the changes made. Then I incorporated the MB to GB conversion. Now I'm running the rest.
The bulk edit parsed 751 canonical datasets and updated 166.
Thanks a lot!
The sizes now match as expected!
I made another bulk edit of ancient canonical datasets that were moved to community organization. I have parsed 11 datasets and opened a PR on 3 of them:
should we force merge the PR and close this issue?
I merged the PRs for "scicite" and "scifact".
As @severo reported in an internal discussion (https://github.com/huggingface/moon-landing/issues/5929):
Now we show the dataset size:
But, even if the size is the same, we see a mismatch because the viewer shows MB, while the info from the README generally shows MiB (even if it's written MB -> https://huggingface.co/datasets/blimp/blob/main/README.md?code=true#L1932)
TODO: Values to be fixed in:
Size of downloaded dataset files:
,Size of the generated dataset:
andTotal amount of disk used: