pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.02k stars 6.93k forks source link

Flaky behavior when downloading Google drive files #2992

Closed jgbradley1 closed 3 years ago

jgbradley1 commented 3 years ago

🐛 Bug

This is a rather difficult bug to diagnose because certain internet activity must be present. The issue is with torhvision.datasets.utils.download_file_from_google_drive(). It does not gracefully handle large files that have exceeded their daily download quota.

To Reproduce

The following two prerequisites must be met in order to detect this issue.

from torchvision.datasets.utils import *
# use the WIDERFACE training data file as an example
file_id = '0B6eKvaijfFUDQUUwd21EckhUbWs'
root = 'data_folder'
filename = 'WIDER_train.zip'
md5 = '3fedf70df600953d25982bcd13d91ba2'
download_file_from_google_drive(file_id, root, filename, md5)

will lead to the python session getting killed.

The python process hangs on the call to torchvision.datasets.utils._quota_exceeded(...). My best guess is the code in this function is performing a string search that is either inefficient or causing python to search the entire data payload (resulting in a timeout).

def _quota_exceeded(response: "requests.models.Response") -> bool:  # type: ignore[name-defined]
    return "Google Drive - Quota exceeded" in response.text

Expected behavior

Calling download_file_from_google_drive(...) should not kill the session when download quota thresholds have been met on large files.

Environment

PyTorch version: 1.8.0.dev20201021
Is debug build: True
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 8.4.0-3ubuntu2) 8.4.0
Clang version: 10.0.1-++20200708122807+ef32c611aa2-1~exp1~20200707223407.61 
CMake version: version 3.18.1

Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: 10.1.105
GPU models and configuration: GPU 0: GeForce GTX 1650
Nvidia driver version: 450.80.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.8.0.dev20201021
[pip3] torchvision==0.9.0a0+9984146
[conda] blas                      1.0                         mkl  
[conda] cpuonly                   1.0                           0    pytorch-nightly
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.2.0            py38h23d657b_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.1           py38hbc911f0_0  
[conda] numpy-base                1.19.1           py38hfa32c7d_0  
[conda] pytorch                   1.8.0.dev20201021     py3.8_cpu_0  [cpuonly]  pytorch-nightly
[conda] torchvision               0.8.0a0+1fbd0b7          pypi_0    pypi

cc @pmeier

pmeier commented 3 years ago

causing python to search the entire data payload (resulting in a timeout).

IMO, this could be the offender. If I remember correctly, the "Google Drive - Quota exceeded" string is at the beginning of the the payload. Thus, we should switch to something like response.iter_content and check if the string is present.

@jgbradley1 Right now, https://docs.google.com/uc?export=download&id=0B6eKvaijfFUDQUUwd21EckhUbWs&confirm=o7Z2 (the URL we request from you sample above), does not trigger the daily quota exceeded. Could you ping me as fast as possible if you see this again, so I can take a look?

KonstantinKhabarlak commented 3 years ago

@pmeier on MiniImageNet I have this issue reproduced every time on my environment (Google Drive Daily quota shouldn't be exceeded) Code:

from torchvision.datasets.utils import download_file_from_google_drive

if __name__ == '__main__':
    print("Start")
    folder = './miniimagenet'
    gdrive_id = '16V_ZlkW4SsnNDtnGmaBRq2OoPmUOc5mY'
    gz_filename = 'mini-imagenet.tar.gz'
    gz_md5 = 'b38f1eb4251fb9459ecc8e7febf9b2eb'

    download_file_from_google_drive(gdrive_id, folder, gz_filename, md5=gz_md5)
    print("Done")

This code downloads MiniImageNet dataset (the dataset itself is ~1Gb worth of data). With _quota_exceeded as is currently implemented all the download is done in "Google Drive - Quota exceeded" in response.text line and no progress bar is shown. Then while the debugger still points to that line I seem to get a memory leak (more than 20Gb of RAM is used), then I just kill the process. The target file never appears on disk

If I comment out _quota_exceeded, I get immediately a progress bar and the file is downloaded. The issue can be reproduced on both Windows 10 and Ubuntu 20.04 with PyTorch 1.7, TorchVision 0.8.0 and requests-2.25.0

jgbradley1 commented 3 years ago

@pmeier I can confirm I'm seeing the same behavior as @KhabarlakKonstantin for mini imagenet. At this time, the mini imagenet file has not exceeded the threshold.

The real issue here then is how we're checking for a quota exceeded error. A string search is just not going to be a viable solution for large files (1+ GB). According to the Google Drive API docs, the http status code should be 403 when the quota threshold has truly been met.

Modifying _quota_exceeded to check the status code first could resolve the immediate issue (downloading large google drive files that have not met their download threshold). A 403 status is used for other reasons as well if you look through the documentation so we probably still want to leave the string search in for now (unless someone has a better suggestion).

def _quota_exceeded(response: "requests.models.Response") -> bool:  # type: ignore[name-defined]
    return (response.status_code == 403) and ("Google Drive - Quota exceeded" in response.text)

Using my modified function above, I am able to download mini imagenet and see the progress bar.

jgbradley1 commented 3 years ago

The mini imagenet file has now exceeded the threshold for today when downloading programatically. Here are some observations of what I can see.

Output of print(response.text) is

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;148676949&#45;untrustedcontent.css rel="stylesheet"><link rel="icon" href="//ssl.gstatic.com/images/branding/product/1x/drive_2020q4_32dp.png"/><style nonce="cpI7ZCLZSoij8xh5gvcKrw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="cpI7ZCLZSoij8xh5gvcKrw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.com/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.com/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.com/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=US&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://news.google.com/?tab=on">News</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.com/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D16V_ZlkW4SsnNDtnGmaBRq2OoPmUOc5mY&service=writely&ec=GAZAMQ" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

I can still manually download the file from here using a browser. Perhaps Google Drive allows more browser-initiated downloads than programmatic-initiated downloads.

The most interesting behavior about this is the reported status code will switch randomly between 403 and 200. It is not always consistent with the GDrive API documentation.

Final conclusion: the function _quota_exceeded works great only after a file has exceeded the daily quota threshold.

pmeier commented 3 years ago

@jgbradley1

A string search is just not going to be a viable solution for large files (1+ GB)

True. I didn't think of this when I implemented the check. We should use response.iter_content to get the first chunk and only search in there. I'm not sure if this interferes with the saving of the file if the quota is not exceeded. I'm on it.

the http status code should be 403 when the quota threshold has truly been met

This was the first thing I checked and I always got 200 back. I only went for checking the response content after that.

the function _quota_exceeded works great only after a file has exceeded the daily quota threshold.

True again, but since it makes normal operation impossible, this is no solution. I'm going to revert the commit for now and send a proper fix later.

ain-soph commented 3 years ago

Still suffering from this problem when downloading CUB_200_2011 datasets from https://drive.google.com/file/d/1hbzc_P1FuxMkcabkgn9ZKinBwW683j45

While downloading from a browser (Chrome) is okay.

jgbradley1 commented 3 years ago

@ain-soph Have you tried the nightly version? The problematic code was temporarily disabled here

https://github.com/pytorch/vision/pull/3035

Those changes have not made into a release yet.

ain-soph commented 3 years ago

@jgbradley1 I see, thanks for your notice!

ORippler commented 3 years ago

@pmeier want to close this since #4109 was merged? Or leave it open to track and properly fix google drive download in the longterm via making use of response.status_code once google adheres to its API

pmeier commented 3 years ago

@ORippler, this was fixed in #4109 so it should be closed. Thanks for the ping!