Closed nbansal90 closed 3 years ago
Unfortunately, you need a GPU with more memory to train the model with multi-task learning. We have used AWS p3.2xlarge spot instances (Nvidia V100) for the experiments. Rewriting the code to avoid using BatchNorm will probably lead to different results. I would recommend either using AWS or using the model with transfer learning but without multi-task learning, which should fit into your GPU memory. When checking out https://arxiv.org/pdf/2012.10782v2.pdf Table 2 Row 3 and 4, you'll see that this only slightly decreases the performance of the model.
To run the experiment with only transfer learning, you can use experiment 210. For that purpose, please comment out all ablations except "sel_{pres_method}_transfer_dcompgt{dc_m}{dc_ft}" in experiments.py#L176 and run
python run_experiments.py --machine ws --exp 210
Hey @lhoyer,
I was looking to run
python run_experiments.py --machine ws --exp 212
to replicate the results on Table 1. However running on single gpu, with a default batch size of 2 , runs inCUDA MEMORY ERROR
on NVIDIA GPU with a max size limit ~11GB.When, I went with a workaround of changing the batch size to
1
, It gave me the following error:File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torchvision/models/segmentation/deeplabv3.py", line 61, in forward x = mod(x) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward self.weight, self.bias, bn_training, exponential_average_factor, self.eps) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/functional.py", line 2012, in batch_norm _verify_batch_size(input.size()) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/functional.py", line 1995, in _verify_batch_size raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])
Now this error is seen in the deeplabv3 part of the code, which has been taken from
torch
library. Now one of the solution would be rewrite the whole deeplabv3 model without the batch norm , but do you suggest any other elegant way to solve this issue.. if you encountered the same in your case.