pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.18k stars 6.95k forks source link

MD5 checksum not match error ; CALTECH datastet #8220

Closed woongjoonchoi closed 8 months ago

woongjoonchoi commented 9 months ago

🐛 Describe the bug

I got md5 checksum not match error when downloaidng CALTECH dataset . i inherit caltech256 dataset and override getitem method.

class Mydataset(Caltech256) :

    def __getitem__(self, index: int) :
        """
        Args:
            index (int): Index

        Returns:
            tuple: (image, target) where target is index of the target class.
        """
        img = Image.open(
            os.path.join(
                self.root,
                "256_ObjectCategories",
                self.categories[self.y[index]],
                f"{self.y[index] + 1:03d}_{self.index[index]:04d}.jpg",
            )
        )
        # print(img.shape)
        target = self.y[index]

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

test_caltech = Mydataset(root='/content',download=True)
2436it [00:00, 17288197.20it/s]
/usr/local/lib/python3.10/dist-packages/torchvision/datasets/utils.py:260: UserWarning: We detected some HTML elements in the downloaded file. This most likely means that the download triggered an unhandled API response by GDrive. Please report this to torchvision at https://github.com/pytorch/vision/issues including the response:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="lkHmQisUsdKAaMG6bK8sIg">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1r6o0pSROcV1_VwT4oSjA2FBUSCWGuxLK">256_ObjectCategories.tar</a> (1.1G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="1r6o0pSROcV1_VwT4oSjA2FBUSCWGuxLK"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="43ee465e-8ada-4333-88c1-2962c6ff4887"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
  warnings.warn(
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-8-670dc68f2b31>](https://localhost:8080/#) in <cell line: 1>()
----> 1 test_caltech = Mydataset(root='/content',download=True)

4 frames
[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/caltech.py](https://localhost:8080/#) in __init__(self, root, transform, target_transform, download)
    172 
    173         if download:
--> 174             self.download()
    175 
    176         if not self._check_integrity():

[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/caltech.py](https://localhost:8080/#) in download(self)
    230             return
    231 
--> 232         download_and_extract_archive(
    233             "https://drive.google.com/file/d/1r6o0pSROcV1_VwT4oSjA2FBUSCWGuxLK",
    234             self.root,

[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/utils.py](https://localhost:8080/#) in download_and_extract_archive(url, download_root, extract_root, filename, md5, remove_finished)
    432         filename = os.path.basename(url)
    433 
--> 434     download_url(url, download_root, filename, md5)
    435 
    436     archive = os.path.join(download_root, filename)

[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/utils.py](https://localhost:8080/#) in download_url(url, root, filename, md5, max_redirect_hops)
    137         file_id = _get_google_drive_file_id(url)
    138         if file_id is not None:
--> 139             return download_file_from_google_drive(file_id, root, filename, md5)
    140 
    141         # download the file

[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/utils.py](https://localhost:8080/#) in download_file_from_google_drive(file_id, root, filename, md5)
    266 
    267     if md5 and not check_md5(fpath, md5):
--> 268         raise RuntimeError(
    269             f"The MD5 checksum of the download file {fpath} does not match the one on record."
    270             f"Please delete the file and try again. "

RuntimeError: The MD5 checksum of the download file /content/caltech256/256_ObjectCategories.tar does not match the one on record.Please delete the file and try again. If the issue persists, please report this to torchvision at https://github.com/pytorch/vision/issues.

i Got md5 checksum error when downloading caltech dataset .

Versions

--2024-01-18 08:04:29--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22068 (22K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  21.55K  --.-KB/s    in 0.03s   

2024-01-18 08:04:30 (619 KB/s) - ‘collect_env.py’ saved [22068/22068]

Collecting environment information...
PyTorch version: 2.1.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.9
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.58+-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             2
On-line CPU(s) list:                0,1
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                         6
Model:                              79
Thread(s) per core:                 2
Core(s) per socket:                 1
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           4400.43
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          32 KiB (1 instance)
L1i cache:                          32 KiB (1 instance)
L2 cache:                           256 KiB (1 instance)
L3 cache:                           55 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0,1
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable; SMT Host state unknown
Vulnerability Meltdown:             Vulnerable
Vulnerability Mmio stale data:      Vulnerable
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==2.1.0+cu121
[pip3] torchaudio==2.1.0+cu121
[pip3] torchdata==0.7.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.16.0
[pip3] torchvision==0.16.0+cu121
[pip3] triton==2.1.0
[conda] Could not collect
tamrobb commented 9 months ago

I saw the same. This script will reproduce the error

from torchvision import datasets
datasets.Caltech101(root='.', download=True)
NicolasHug commented 9 months ago

Thanks for the report.

It seems related to https://github.com/pytorch/vision/issues/8204#issuecomment-1891755665 and the root cause is probably not an MD5 issue but rather a change in GDrive's APIs.

We're considering options here - probably will add gdown as an optional dependency, as we can't really afford to maintain all the different gdrive APIs within torchvision. We'll keep you posted and sorry for the inconvenience

Bhavay-2001 commented 9 months ago

Hi @NicolasHug, is this issue still open? If yes, I would be happy to help. If not, can you please refer me some other issue to work on? Thanks

Bhavay-2001 commented 8 months ago

Hi @NicolasHug, any beginner friendly issue you can recommend?

braindevices commented 5 months ago

I get the same thing