CUB200 extraction problem

AntreasAntoniou commented 1 year ago

When I run:

import learn2learn as l2l
from tqdm.auto import tqdm

data = l2l.vision.datasets.CUBirds200(
    root="data/",
    mode="all",
    download=True,
    bounding_box_crop=False,
)

label_set = set()

with tqdm(total=len(data)) as pbar:
    for item in data:
        label_set.add(item[1])
        pbar.update(1)
        pbar.set_description(f"Found {len(label_set)} labels")

I get

Traceback (most recent call last):
  File "/devcode/GATE-private/playgrounds/aircraft-play.py", line 4, in <module>
    data = l2l.vision.datasets.CUBirds200(
  File "/devcode/learn2learn/learn2learn/vision/datasets/cu_birds200.py", line 342, in __init__
    self.download()
  File "/devcode/learn2learn/learn2learn/vision/datasets/cu_birds200.py", line 357, in download
    tar_file = tarfile.open(tar_path)
  File "/opt/conda/envs/main/lib/python3.10/tarfile.py", line 1639, in open
    raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('invalid header')

I had previously made a PR that resolved this but I was told that it wasn't a problem, however on my ubuntu machine it seems to throw this error.

AntreasAntoniou commented 1 year ago

Turns out it's a download problem.

I know I PR'd this before and it was patched, but the issue seems to persist on my end. The solution that fixes this is

https://github.com/AntreasAntoniou/learn2learn/commit/ecd4d2226a21b0078b40f737e4e3c7baec3188c6

I know that one ideally prefers to not add more dependencies to a repository, but with stuff like google drive downloading which over time requires different hacks with the tokens and stuff, perhaps resigning that duty to a purpose built package like gdown makes sense.

Either way I understand the pros and cons here. Let me know your thoughts

seba-1511 commented 1 year ago

Thanks for flagging this @AntreasAntoniou.

I think the more permanent fix is just to move these GDrive datasets to Zenodo -- it's also more reliable as they're pretty much guaranteed to remain accessible (as opposed to someone removing their files from Google Drive).

I had missed CUB200 (and the other ones from MetaDataset). I'll put them on Zenodo this weekend.

seba-1511 commented 1 year ago

Update: all datasets are on Zenodo (but only downloaded from there if necessary), the fix will be merged in #406.

learnables / learn2learn

CUB200 extraction problem #405