pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.24k stars 6.95k forks source link

RuntimeError: CUDA out of memory. (But I have 24G memory available) #4149

Open Coldfire93 opened 3 years ago

Coldfire93 commented 3 years ago

Hi, I run the train.py script of the reference code.

But got cuda out of memory error shown below:

Start training Epoch: [0] [ 0/2507] eta: 0:15:11 lr: 0.000040 loss: 3.9314 (3.9314) loss_classifier: 3.1079 (3.1079) loss_box_reg: 0.0669 (0.0669) loss_objectness: 0.6984 (0.6984) loss_rpn_box_reg: 0.0583 (0.0583) time: 0.3635 data: 0.0372 max mem: 2224 Epoch: [0] [ 20/2507] eta: 0:13:41 lr: 0.000440 loss: 3.4063 (3.2239) loss_classifier: 2.6814 (2.4672) loss_box_reg: 0.0116 (0.0211) loss_objectness: 0.6958 (0.6942) loss_rpn_box_reg: 0.0332 (0.0413) time: 0.3285 data: 0.0173 max mem: 2657 Epoch: [0] [ 40/2507] eta: 0:13:35 lr: 0.000839 loss: 1.2476 (2.1992) loss_classifier: 0.5336 (1.5398) loss_box_reg: 0.1127 (0.0682) loss_objectness: 0.3905 (0.5478) loss_rpn_box_reg: 0.0425 (0.0434) time: 0.3309 data: 0.0168 max mem: 2657 Epoch: [0] [ 60/2507] eta: 0:13:24 lr: 0.001239 loss: 0.5988 (1.7346) loss_classifier: 0.3086 (1.1685) loss_box_reg: 0.1608 (0.1021) loss_objectness: 0.1539 (0.4219) loss_rpn_box_reg: 0.0258 (0.0421) time: 0.3249 data: 0.0168 max mem: 2821 Epoch: [0] [ 80/2507] eta: 0:13:17 lr: 0.001638 loss: 0.5835 (1.4710) loss_classifier: 0.3268 (0.9616) loss_box_reg: 0.1620 (0.1204) loss_objectness: 0.0840 (0.3487) loss_rpn_box_reg: 0.0230 (0.0402) time: 0.3278 data: 0.0178 max mem: 2821 Epoch: [0] [ 100/2507] eta: 0:13:09 lr: 0.002038 loss: 0.4755 (1.2898) loss_classifier: 0.2139 (0.8241) loss_box_reg: 0.1554 (0.1294) loss_objectness: 0.0611 (0.2994) loss_rpn_box_reg: 0.0190 (0.0369) time: 0.3258 data: 0.0158 max mem: 2821 Traceback (most recent call last): File "train.py", line 235, in main(args) File "train.py", line 208, in main train_one_epoch(model, optimizer, data_loader, device, epoch, args.print_freq) File "/home/songhongguang/lwh/vision-master/references/detection/engine.py", line 30, in train_one_epoch loss_dict = model(images, targets) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 78, in forward images, targets = self.transform(images, targets) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 104, in forward image, target_index = self.resize(image, target_index) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 151, in resize image, target = _resize_image_and_masks(image, size, float(self.max_size), target) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 52, in _resize_image_and_masks mask = F.interpolate(mask[:, None].float(), scale_factor=scale_factor, recompute_scale_factor=True)[:, 0].byte() File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/functional.py", line 3532, in interpolate return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors) RuntimeError: CUDA out of memory. Tried to allocate 20.51 GiB (GPU 0; 23.87 GiB total capacity; 5.62 GiB already allocated; 17.22 GiB free; 5.87 GiB reserved in total by PyTorch)

I'm confused because I have 24G memory available:

figure1_nvidia_before

I have tried to set batch_size to 2, num_workers=0, also pin_memory=False.

Could you please tell me the reason? Thank you!

NicolasHug commented 3 years ago

Could you indicate the exact command line that you're using, and on which dataset? It sounds strange that the training decides to allocate 20GB of memory at the 100th batch