schmitt-muc / SEN12MS

Repository for SEN12MS related codes and utilities
Other
98 stars 19 forks source link

Unable to download dataset from command line #4

Open adamjstewart opened 3 years ago

adamjstewart commented 3 years ago

Hi, I'm working on a torchvision-style dataset that automatically downloads and checksums SEN12MS. I see that the dataset is hosted on https://dataserv.ub.tum.de/s/m1474000. However, when I try to download one of the files, I get an error message:

$ wget 'https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz'
--2021-06-10 21:01:24--  https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz
Resolving dataserv.ub.tum.de (dataserv.ub.tum.de)... 138.246.224.34, 2001:4ca0:800::8af6:e022
Connecting to dataserv.ub.tum.de (dataserv.ub.tum.de)|138.246.224.34|:443... connected.
ERROR: cannot verify dataserv.ub.tum.de's certificate, issued by ‘CN=DFN-Verein Global Issuing CA,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE’:
  Unable to locally verify the issuer's authority.
To connect to dataserv.ub.tum.de insecurely, use `--no-check-certificate'.

Clicking on the download button allows me to download through the web browser, but I would like to be able to download from the command line. Is this possible (without disabling security certificate checks)?

@calebrob6

adamjstewart commented 3 years ago

Current workaround pointed out by @calebrob6:

$ wget "ftp://m1474000:m1474000@dataserv.ub.tum.de/ROIs1158_spring_lc.tar.gz"
schmitt-muc commented 3 years ago

Sorry for the late reply! I would prefer rsync: "The data server also offers downloads with rsync (password m1474000): rsync rsync://m1474000@dataserv.ub.tum.de/m1474000/"

adamjstewart commented 3 years ago

Hi @schmitt-muc, when I run that command it doesn't download anything.

I'm trying to write a PyTorch data loader. Torchvision is able to automatically download and checksum datasets from a URL, but the FTP and rsync URLs don't work for this.

schmitt-muc commented 3 years ago

I have just checked (running Ubuntu 20.04 LTS from inside Windows 10 Enterprise using WSL2): Running the command rsync -chavzP --stats rsync://m1474000@dataserv.ub.tum.de/m1474000/ path/to/your/local/storage/folder works. Of course you first have to enter the password m1474000, and of course retrieving the incremental file list takes ages, but it should do the job.

adamjstewart commented 3 years ago

Yes, that seems to work, although I still can't download the data from Python without calling some system rsync executable. A normal URL would be much nicer for cases where users aren't using rsync.

schmitt-muc commented 3 years ago

Ah, now I understand. I suggest following Caleb Robinson's advice. At least for me wget -r "ftp://m1474000:m1474000@dataserv.ub.tum.de" does the job just fine and downloads the whole package automatically.

adamjstewart commented 3 years ago

Yes, that URL works with wget but not with Python's urllib for some reason. Is there a working https:// option?

schmitt-muc commented 3 years ago

I have sent an inquiry to TUM's library, which hosts the data on their media server. The response won't make you too happy: There is definitely no https:// option, as also the .zip file you can download when clicking the Download button in the graphical interface is only created on the fly using some internal Nextcloud function. The only suggestion I got was to look into the Python libraries ftplib, wget and urllib2, which are dedicated to ftp downloads.

schmitt-muc commented 3 years ago

There also seems to be a mirrored version on Google Cloud Storage, see https://gitlab.com/frontierdevelopmentlab/disaster-prevention/sen12ms: gsutil -m rsync -r gs://fdl_floods_2019_data/SEN12MS. Not sure whether this is of any help for you, though