glowabio / hydrographr

https://glowabio.github.io/hydrographr/
GNU General Public License v3.0
13 stars 1 forks source link

get_regional_unit_id(): Download too large (for virus scan), need to manually download #45

Closed merretbuurman closed 4 months ago

merretbuurman commented 8 months ago

When using the get_regional_unit_id() function, the function did not download the regional_unit_ovr.tif file directly, instead it downloaded a HTML file with this question:

Google Drive can't scan this file for viruses. regional_unit_ovr.tif (118M) is too large for Google to scan for viruses. Would you still like to download this file?

(in German: Google Drive kann keinen Virenscan für diese Datei durchführen. regional_unit_ovr.tif (118M) ist zu groß und kann von Google nicht auf Viren geprüft werden. Möchten Sie die Datei trotzdem herunterladen?)

So the function fails with the error ERROR 4: '/tmp/Rtmp8KQI/regional_unit_ovr.tif' not recognized as a supported file format.

I have not checked to prevent this, but I added a check to the function and instructions to download the file manually and placing it into the right directory (this branch here: https://github.com/merretbuurman/hydrographr/tree/dev_sugar_regional_unit_id ). I'll push or make a PR as soon as I have figured out how to build.

Details of the returned file:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="eoQq2YWQ-WS0zadDfQUpVQ">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1ykV0jRCglz-_fdc4CJDMZC87VMsxzXE4">regional_unit_ovr.tif</a> (118M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="1ykV0jRCglz-_fdc4CJDMZC87VMsxzXE4"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="ef3ce457-d831-4c5c-b3e8-e4ed01838c28"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
merretbuurman commented 8 months ago

TODO: Check if we can prevent the behaviour. - Just add the file to Nimbus? TODO: My solution may fail if the text is returned in another language than English. Maybe need to specify the locale.

merretbuurman commented 8 months ago

I added the download link for the file on Nimbus, so that works now.

If Nimbus is available, it downloads and uses it:

>
> get_regional_unit_id(df, "lon", "lat")
Downloading the global regional unit file to /tmp/RtmpOZY0OU/regional_unit_ovr.tif...
trying URL 'https://public.igb-berlin.de/index.php/s/agciopgzXjWswF4/download?path=%2Fglobal&files=regional_unit_ovr.tif'
Content type 'image/tiff' length 123262363 bytes (117.6 MB)
==================================================
downloaded 117.6 MB

[1] 58
> 

If Nimbus is not available, it will try the GDrive URL and likely fail, but then the user gets a hint on how to manually fix:

> 
> get_regional_unit_id(df, "lon", "lat")
Downloading the global regional unit file to /tmp/RtmpOZY0OU/regional_unit_ovr.tif...
trying URL 'https://pppublic.igb-berlin.de/index.php/s/agciopgzXjWswF4/download?path=%2Fglobal&files=regional_unit_ovr.tif'
Download failed, reason:  URL 'https://pppublic.igb-berlin.de/index.php/s/agciopgzXjWswF4/download?path=%2Fglobal&files=regional_unit_ovr.tif': status was 'Couldn't resolve host name'
trying URL 'https://drive.google.com/uc?export=download&id=1ykV0jRCglz-_fdc4CJDMZC87VMsxzXE4&confirm=t'
Content type 'text/html; charset=utf-8' length 2433 bytes
==================================================
downloaded 2433 bytes

The file /tmp/RtmpOZY0OU/regional_unit_ovr.tif is only 2433 bytes, maybe the download went wrong.
The file /tmp/RtmpOZY0OU/regional_unit_ovr.tif contains text asking you whether to download, so the download definitely went wrong.
Error in get_regional_unit_id(df, "lon", "lat") : 
  Downloading the file "regional_unit_ovr.tif" went wrong, as you manually need to confirm skipping the virus check.
Please download manually at https://drive.google.com/uc?export=download&id=1ykV0jRCglz-_fdc4CJDMZC87VMsxzXE4&confirm=t and store to /tmp/RtmpOZY0OU/regional_unit_ovr.tif . Stopping.
> 

The code is now here: https://github.com/merretbuurman/hydrographr/tree/dev_get_regional_unit_id_fix_download_bug

Once I have tested and linted it all properly, I will merge it to main.

merretbuurman commented 4 months ago

This was already merged into main, see commit https://github.com/merretbuurman/hydrographr/commit/aa7fb4b40c45ba54cbbfa57667e70d6dec86f9e2 from 5 March.