lhoyer / improving_segmentation_with_selfsupervised_depth

[CVPR21] Implementation of our work "Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation"
247 stars 30 forks source link

Memory Error while Running Exp id: 212 #4

Closed nbansal90 closed 3 years ago

nbansal90 commented 3 years ago

Hey @lhoyer,

I was looking to run python run_experiments.py --machine ws --exp 212 to replicate the results on Table 1. However running on single gpu, with a default batch size of 2 , runs in CUDA MEMORY ERROR on NVIDIA GPU with a max size limit ~11GB.

When, I went with a workaround of changing the batch size to 1, It gave me the following error: File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torchvision/models/segmentation/deeplabv3.py", line 61, in forward x = mod(x) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward self.weight, self.bias, bn_training, exponential_average_factor, self.eps) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/functional.py", line 2012, in batch_norm _verify_batch_size(input.size()) File "/data/home/us000146/anaconda3/envs/selfsupervised/lib/python3.6/site-packages/torch/nn/functional.py", line 1995, in _verify_batch_size raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])

Now this error is seen in the deeplabv3 part of the code, which has been taken from torch library. Now one of the solution would be rewrite the whole deeplabv3 model without the batch norm , but do you suggest any other elegant way to solve this issue.. if you encountered the same in your case.

lhoyer commented 3 years ago

Unfortunately, you need a GPU with more memory to train the model with multi-task learning. We have used AWS p3.2xlarge spot instances (Nvidia V100) for the experiments. Rewriting the code to avoid using BatchNorm will probably lead to different results. I would recommend either using AWS or using the model with transfer learning but without multi-task learning, which should fit into your GPU memory. When checking out https://arxiv.org/pdf/2012.10782v2.pdf Table 2 Row 3 and 4, you'll see that this only slightly decreases the performance of the model.

To run the experiment with only transfer learning, you can use experiment 210. For that purpose, please comment out all ablations except "sel_{pres_method}_transfer_dcompgt{dc_m}{dc_ft}" in experiments.py#L176 and run

python run_experiments.py --machine ws --exp 210