Closed abhinavkaul95 closed 3 years ago
@abhinavkaul95 I am the author @hbb1, not @piaozhx . Please don't bother him 🤣.
According to the logs_training.txt, the error is caused by CUDA out of memory
.
result = self.forward(*input, **kwargs)
File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/optimized_sync_batchnorm_kernel.py", line 64, in forward
out = syncbn.batchnorm_forward(input, mean, inv_std, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 7.80 GiB total capacity; 3.80 GiB already allocated; 17.50 MiB free; 3.82 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
Make sure you have sufficient GPU memory or adjust the batch size. In my configuration, I use 2 TITANX and batch size 64.
Sure @hbb1, I think I was just blindly commenting on the issues instead of seeing who was solving them.. :rofl: :rofl: :rofl:
I have made the batch size smaller and that seems to have solved the issue. I will let you know if the issue persists, which is unlikely though. Thanks. :smiley: :+1:
Hi @piaozhx,
While trying to train the model, after sometime, training is getting stopped and I am getting the following error at the end:
There are some more errors in between as well, but this was the error at the last. The complete logs of the training are attached with the issue. logs_training.txt