tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.55k forks source link

Nonsensical DL size in progress bar when some files are already downloaded #3759

Open IRDonch opened 2 years ago

IRDonch commented 2 years ago

Short description Running tfds build <dataset> when some of the files have already been downloaded and some are missing results in meaningless sizes being displayed in the progress bar.

Environment information

Reproduction instructions

First, run tfds build voc/2012 and wait for it to finish.

Then, remove $TFDS_DATA_DIR/downloads/pjredd.com_media_files_VOCtra_11-May-20124U92MnDPGT0LX3SxafRBV6Swxu-nCPTdD_eO5pF2O8s.tar and $TFDS_DATA_DIR/voc/2012/4.0.0.

Then run tfds build voc/2012 again. The progress bar will show something like this:

Dl Size...: 100%|█████████████████████████████████████████████▉| 1850626566/1850628467 [00:12<00:00, 146829508.09 MiB/s]

This doesn't make sense, as it implies that the missing file is less than 1% of the total size of the dataset's files, even though it's almost 2 GB in size.

Link to logs N/A

Expected behavior The progress bar should display numbers that are correctly proportioned.

Additional context This most likely happens because the code that reports progress is inconsistent about the units it uses. Here and here it uses bytes, while here it uses megabytes.

Conchylicultor commented 2 years ago

Thanks for reporting, indeed: https://github.com/tensorflow/datasets/blob/47baec10e957bffbd52c960cfce1ad31c60e04dd/tensorflow_datasets/core/download/downloader.py#L138 should be updated to use the correct unit. Don't hesitate to send a PR