jramapuram / BYOL

Bootstrap Your Own Latent (BYOL) pytorch implementation using DistributedDataParallel.
MIT License
28 stars 2 forks source link

Image loader error #1

Closed jlindsey15 closed 4 years ago

jlindsey15 commented 4 years ago

Hi! Thanks for this code. I'm getting the following error when trying to run it. Any idea what might be happening? Thank you!

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/share/ctn/users/jwl2182/BYOL/main.py", line 744, in run loader, model, grapher = build_loader_model_grapher(args) # build the model, loader and grapher File "/share/ctn/users/jwl2182/BYOL/main.py", line 419, in build_loader_model_grapher loader = get_loader(loader_dict) File "/share/ctn/users/jwl2182/BYOL/datasets/loader.py", line 175, in get_loader kwargs) File "/share/ctn/users/jwl2182/BYOL/datasets/imagefolder.py", line 132, in init train_samples_and_labels = self.train_loader.iter().next() File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate return [default_collate(samples) for samples in transposed] File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 68, in return [default_collate(samples) for samples in transposed] File "/home/jwl2182/.conda/envs/py36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 479 and 140in dimension 2 at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/TH/generic/THTensor.cpp:711

jramapuram commented 4 years ago

Hi there,

Thanks for testing this out!

What I usually do is resize all of imagenet to 256x256 and then use augmentations of 224x224 for training and 224x224 center crops for testing. The error you are seeing is due to an image having a dimension of 140 which is less than the . Your options are:

  1. Resize all images to 256x256 and all the rest should work.
  2. Modify the transforms: https://github.com/jramapuram/BYOL/blob/master/main.py#L347-L400 [reading larger files will be slower though, so probably try 1. first]

Let me know if you have any further issues

jlindsey15 commented 4 years ago

Thanks for the quick response! What modification to the transforms would be needed? I'm using standard ImageNet images and haven't had this difficulty with similar models (SimCLR, MOCO, etc.).

jramapuram commented 4 years ago

The official pytorch Moco works for you? The transforms are very similar; I'd suspect that if you see it here you would also see it in the Moco implementation

BYOL on left; Moco on right image

jlindsey15 commented 4 years ago

Yeah, I've used the official pytorch Moco without modification. I can pinpoint the error to the following:

running:

temp = MultiAugmentImageFolder(path="path_to_imagenet", batch_size=512)

gives me the same error as above ("Got X and Y in dimension 2" where X and Y vary from run to run)

But running:

temp = torchvision.datasets.ImageFolder("path_to_imagenet")

works fine.

Is it possible to replace the MultiAugmentImageFolder class in your code with the standard Pytorch ImageFolder?

jramapuram commented 4 years ago

MultiAugmentImageDataset uses ImageFolder and is about as barebones of an implementation as you can get to do multiple augmentations. It already inherits from torchvision.datasets.ImageFolder

MultiAugmentImageFolder simple builds the torchvision dataset, adds the transforms and wraps it in a torch data loader. If you complete your example, you get the same error with torch on non-resized imagenet:

pytorch = torch.utils.data.DataLoader(torchvision.datasets.ImageFolder("path_to_imagenet", transform=torchvision.transforms.ToTensor()), batch_size=32, num_workers=4)
pytorch.__iter__().__next__()

This results in the following in pytorch 1.5.1 on py37

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/jramapuram/.venv/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jramapuram/.venv/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/jramapuram/.venv/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/jramapuram/.venv/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/jramapuram/.venv/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3, 250, 250] at entry 0 and [3, 150, 200] at entry 1
jlindsey15 commented 4 years ago

Yeah, that's true -- but including

transforms.RandomResizedCrop(224, scale=(0.08, 1.))

in the dataset transforms fixes the issue when you use torchvision.datasets.ImageFolder, whereas it doesn't seem to be helping using MultiAugmentImageFolder.

I don't want to cause a hassle for you, I will do my best to figure it out!

jramapuram commented 4 years ago

I don't want to cause a hassle for you, I will do my best to figure it out!

No worries at all; glad to have someone test it out :)

transforms.RandomResizedCrop(224, scale=(0.08, 1.))

Yup, same applies for MultiAugmentImageFolder:

In [17]: temp = MultiAugmentImageFolder(path="path_to_imagenet_root", batch_size=32, train_transform=[torchvision.transforms.RandomResizedCrop((224,224)), torchvision.transforms.ToTensor()])
    ...:
dataset loader:  {'num_workers': 2, 'pin_memory': True, 'worker_init_fn': None, 'timeout': 0, 'drop_last': True}
train = 1281167 | test = 50000 | valid = 0
derived image shape =  [3, 224, 224]
derived output size =  1000

Didn't error out for me (this was non-resized imagenet).

jlindsey15 commented 4 years ago

Weird, when I run exactly the same code

temp = MultiAugmentImageFolder(path="path_to_imagenet_root", batch_size=32, train_transform=[torchvision.transforms.RandomResizedCrop((224,224)), torchvision.transforms.ToTensor()])

I get the error. Do you think the pytorch / torchvision versions could be relevant? I'm using PyTorch 1.1.0 and TorchVision 0.3.0

EDIT: I just replicated on PyTorch 1.5.0 and torchvision 0.6.0

jramapuram commented 4 years ago

Interesting, might be worth a shot in a fresh conda env (I have tested with py37 on pytorch1.5 and pytorch1.5.1), but before you do that can you verify that you followed the README.md and have a ‘train’ and ‘test’ folder in your imagenet directory? You can also just create a symlink from ‘val’ to ‘test’. I doubt its the later issue because the error appears on a concatenation, but just want to be sure :)

On Mon, Jun 22, 2020 at 5:43 AM Jack Lindsey notifications@github.com wrote:

Weird, when I run exactly the same code

temp = MultiAugmentImageFolder(path="path_to_imagenet_root", batch_size=32, train_transform=[torchvision.transforms.RandomResizedCrop((224,224)), torchvision.transforms.ToTensor()])

I get the error. Do you think the pytorch / torchvision versions could be relevant? I'm using PyTorch 1.1.0 and TorchVision 0.3.0

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jramapuram/BYOL/issues/1#issuecomment-647254741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6TEB7EN6ZGA7WPBATTNF3RX3HL3ANCNFSM4OECPGRA .

jlindsey15 commented 4 years ago

Yeah I made a symlink with "train" and "test." Fresh conda env with py35 and pytorch 1.5.1 has the same issue, alas :(

(though the wording of the error message is a bit different: "RuntimeError: stack expects each tensor to be equal size, but got [3, 375, 500] at entry 0 and [3, 342, 500] at entry 2")

jlindsey15 commented 4 years ago

I think I've solved the issue! I have a follow up question if you don't mind answering. It arose from the getitem method:

    def __getitem__(self, index):
        """Label is the same for index, so just run augmentations again."""
        sample0, target = self.__getitem_non_transformed__(index)
        samples = [sample0] + [super(MultiAugmentImageDataset, self).__getitem__(index)[0]
                               for _ in range(self.num_augments)]
        return samples + [target]

self.getitem_non_transformed was not reshaping images to 224x224, hence the stacking errors. By setting the non_augment_transform, I resolved the issue.

However, this produced another error downstream in the code. The block above returns a list like [unaugmented, augmentation1, augmentation2, label]. But the rest of the code is set up to receive a list of [augmentation1, augmentation2, label], leading to a "too many values to unpack (expected 3)" error when you start iterating through the train_loader. I can fix this issue by changing the line above:

samples = [sample0] + [super(MultiAugmentImageDataset, self).__getitem__(index)[0]
                               for _ in range(self.num_augments)]

to

samples = [super(MultiAugmentImageDataset, self).__getitem__(index)[0]
                               for _ in range(self.num_augments)]

Is this the correct thing to do, or am I missing something?

jramapuram commented 4 years ago

If you clone via git clone --recursive git+ssh://git@github.com/jramapuram/BYOL.git as per the README.md you wont have this error. The entire point is git submodules is to tightly couple dependencies, so BYOL is coupled with commit 7c5d0d9 from datasets.

Could you add a note here as to what fixed your original bug? Would be useful for tracking. Feel free to open another issue if you have problems.

jlindsey15 commented 4 years ago

Thanks, this (downloading the correct versions of the dependency repos) fixed the issue!