openvinotoolkit / datumaro

Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
https://openvinotoolkit.github.io/datumaro/
MIT License
532 stars 132 forks source link

PascalVOC 2012 Download Error #1365

Closed alfieroddan closed 7 months ago

alfieroddan commented 7 months ago

Error when downloading Pascal VOC2012, seems there is some corrupted data (checksum failure), see below:

System Software Overview:

System Version: macOS 13.4.1 (c) (22F770820d)
Kernel Version: Darwin 22.5.0

Steps to reproduce:

python3.9 -m venv env
source env/bin/activate
pip install geti-sdk
pip install datumaro[tf,tfds,defualt]
datum project create
mkdir pvoc/
datum download get -i tfds:voc/2012 -o pvoc
(env) ~/Documents/end-to-end$ datum download get -i tfds:voc/2012 -o pvoc
2024-03-20 17:53:13,376 INFO: Downloading the dataset
2024-03-20 17:53:13,862 INFO: Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: voc/2012/4.0.0
2024-03-20 17:53:15,028 INFO: Load dataset info from /var/folders/wk/k02s52zj5ls80jdjqwbpj9f40000gn/T/tmprab7ltpatfds
2024-03-20 17:53:15,038 INFO: For 'voc/2012/4.0.0': fields info.[description, config_name, config_description, citation, splits, module_name] differ on disk and in the code. Keeping the one from code.
2024-03-20 17:53:15,039 INFO: Generating dataset voc (/Users/alfie/tensorflow_datasets/voc/2012/4.0.0)
Downloading and preparing dataset 3.59 GiB (download: 3.59 GiB, generated: Unknown size, total: 3.59 GiB) to /Users/alfie/tensorflow_datasets/voc/2012/4.0.0...
Dl Completed...: 0 url [00:00, ? url/s]          2024-03-20 17:53:15,399 INFO: Downloading http://pjreddie.com/media/files/VOC2012test.tar into /Users/alfie/tensorflow_datasets/downloads/pjreddie.com_media_files_VOC2012testSc40bSzLCI9xUDzIh-tQzx9zaTEmX0PKoAD-berNkm0.tar.tmp.c9883184834b43eaa52a5a24a1e0ecd4...ed...: 0 file [00:00, ? file/s]
Dl Completed...:   0%|                           2024-03-20 17:53:15,402 INFO: Downloading http://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar into /Users/alfie/tensorflow_datasets/downloads/pjredd.com_media_files_VOCtra_11-May-2012_SKxIa1HVZJ7bc2YyTlJDGn_RJHSyFnsUqChVOhOMZs.tar.tmp.1bb981155ca145989a7a82080c38a690...:00, ? file/s]
Extraction completed...: 0 file [10:26, ? file/s]██████████████████▌                                             | 1/2 [10:26<10:26, 626.89s/ url]
Dl Size...:  26%|████████████████████████▎                                                                   | 969/3671 [10:26<29:08,  1.55 MiB/s]
Dl Completed...:  50%|█████████████████████████████████████████████▌                                             | 1/2 [10:26<10:26, 626.90s/ url]
2024-03-20 18:03:42,297 ERROR: Artifact http://pjreddie.com/media/files/VOC2012test.tar, downloaded to /Users/alfie/tensorflow_datasets/downloads/pjreddie.com_media_files_VOC2012testSc40bSzLCI9xUDzIh-tQzx9zaTEmX0PKoAD-berNkm0.tar.tmp.c9883184834b43eaa52a5a24a1e0ecd4/VOC2012test.tar, has wrong checksum:
* Expected: UrlInfo(size=1.72 GiB, checksum='f08582b1935816c5eab3bbb1eb6d06201a789eaa173cdf1cf400c26f0cac2fb3', filename='VOC2012test.tar')
* Got: UrlInfo(size=500.25 MiB, checksum='95c6c7d4bca6f0bc5968cd453f932aefd937a1b952007065ec19743b9c3c2fb6', filename='VOC2012test.tar')
To debug, see: https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror
Traceback (most recent call last):
  File "/Users/alfie/Documents/end-to-end/env/bin/datum", line 8, in <module>
    sys.exit(main())
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/datumaro/cli/__main__.py", line 150, in main
    retcode = args.command(args)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/datumaro/cli/commands/download.py", line 185, in download_command
    extractor = extractor_factory()
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/datumaro/components/extractor_tfds.py", line 573, in make_extractor
    return _TfdsExtractor(self._tfds_ds_name)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/datumaro/components/extractor_tfds.py", line 489, in __init__
    tfds_builder.download_and_prepare()
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/logging/__init__.py", line 166, in __call__
    return function(*args, **kwargs)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 691, in download_and_prepare
    self._download_and_prepare(
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1547, in _download_and_prepare
    split_generators = self._split_generators(  # pylint: disable=unexpected-keyword-arg
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/object_detection/voc.py", line 199, in _split_generators
    paths = dl_manager.download_and_extract(
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 688, in download_and_extract
    return _map_promise(self._download_extract, url_or_urls)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 831, in _map_promise
    res = tree_utils.map_structure(
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tree/__init__.py", line 435, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tree/__init__.py", line 435, in <listcomp>
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 832, in <lambda>
    lambda p: p.get(), all_promises
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/promise/promise.py", line 87, in try_catch
    return (handler(*args, **kwargs), None)
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 408, in <lambda>
    lambda dl_result: self._register_or_validate_checksums(  # pylint: disable=g-long-lambda
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 465, in _register_or_validate_checksums
    _validate_checksums(
  File "/Users/alfie/Documents/end-to-end/env/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 809, in _validate_checksums
    raise NonMatchingChecksumError(msg)
tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact http://pjreddie.com/media/files/VOC2012test.tar, downloaded to /Users/alfie/tensorflow_datasets/downloads/pjreddie.com_media_files_VOC2012testSc40bSzLCI9xUDzIh-tQzx9zaTEmX0PKoAD-berNkm0.tar.tmp.c9883184834b43eaa52a5a24a1e0ecd4/VOC2012test.tar, has wrong checksum:
* Expected: UrlInfo(size=1.72 GiB, checksum='f08582b1935816c5eab3bbb1eb6d06201a789eaa173cdf1cf400c26f0cac2fb3', filename='VOC2012test.tar')
* Got: UrlInfo(size=500.25 MiB, checksum='95c6c7d4bca6f0bc5968cd453f932aefd937a1b952007065ec19743b9c3c2fb6', filename='VOC2012test.tar')
To debug, see: https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror

Any help would be greatly appreciated.

alfieroddan commented 7 months ago

Possibly a TensorflowDataset issue and not datumaro's, apologies if so.

alfieroddan commented 7 months ago

Looks like it is a problem with tensorflow dataset...

https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror

https://github.com/tensorflow/datasets/issues/5232

Will close issue but might be an idea to add a warning in docs :).

wonjuleee commented 7 months ago

@alfieroddan, thank you for letting us know this issue and your interests to Datumaro :)