microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.7k stars 333 forks source link

SpaceNet8 Download broken #2366

Open nilsleh opened 2 hours ago

nilsleh commented 2 hours ago

Description

File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/spacenet.py", line 146, in __init__
    self._verify()
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/spacenet.py", line 332, in _verify
    aws('s3', 'cp', url, root)
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/site-packages/torchgeo/datasets/utils.py", line 290, in __call__
    return subprocess.run((self.name, *args), **kwargs)
  File "/opt/anaconda3/envs/torchEnv/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/usr/local/bin/aws', 's3', 'cp', 's3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz', './SN8_floods/train')' returned non-zero exit status 1.

Steps to reproduce


from torchgeo.datasets import SpaceNet8

ds = SpaceNet8(root=".", split="train", download=True)

Or potentially, I also need to configure something else? I do have aws-cli installed.

Version

0.7.0.dev0

nilsleh commented 2 hours ago

Nevermind, I just need to learn how to use aws-cli properly.

adamjstewart commented 1 hour ago

I'm not able to reproduce the exact error message (the download "succeeds" for me), but the downloaded file is corrupted, and tar crashes instead:

> python3
>>> from torchgeo.datasets import SpaceNet8
>>> ds = SpaceNet8(root="data", split="train", download=True)
download: s3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz to data/SN8_floods/train/Germany_Training_Public.tar.gz
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Adam/torchgeo/torchgeo/datasets/spacenet.py", line 146, in __init__
    self._verify()
  File "/Users/Adam/torchgeo/torchgeo/datasets/spacenet.py", line 336, in _verify
    extract_archive(os.path.join(root, tarball), root)
  File "/Users/Adam/spack/var/spack/environments/default/.spack-env/view/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 374, in extract_archive
    extractor(from_path, to_path, compression)
  File "/Users/Adam/spack/var/spack/environments/default/.spack-env/view/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 220, in _extract_tar
    tar.extractall(to_path)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2265, in extractall
    self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2328, in _extract_one
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2411, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 2465, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/tarfile.py", line 252, in copyfileobj
    buf = src.read(bufsize)
          ^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/gzip.py", line 301, in read
    return self._buffer.read(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Adam/spack/opt/spack/darwin-sequoia-m2/apple-clang-16.0.0/python-3.11.9-miamin5zo2vhkrb22ej7xpjqlcjsuugs/lib/python3.11/gzip.py", line 518, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

The checksum is indeed different. However, when I download the file outside of TorchGeo, I don't see this issue:

> aws s3 cp s3://spacenet-dataset/spacenet/SN8_floods/tarballs/Germany_Training_Public.tar.gz .
> md5 Germany_Training_Public.tar.gz 
MD5 (Germany_Training_Public.tar.gz) = 5f1c9ac3ea94f2909da593d894680ea2
> tar xzf Germany_Training_Public.tar.gz 

Unclear if this is a transient issue or something else.

P.S. I think I still have SN8 (and all other versions) downloaded on our AI4EO server if you need it immediately.

adamjstewart commented 1 hour ago

Also, the lead on SN8 was Ronny Haensch from DLR. I have an email thread with him asking about the SN8 AOIs if you want me to ping him on this. But I think we need to get to the bottom of why it isn't working inside TorchGeo first.

nilsleh commented 1 hour ago

You are right, the corrupted download also happens for the "test" split. I wanted to download the dataset, so I can add a datamodule for spacenet 6 and 8. Spacenet6 downloads fine with no errors.