RuntimeError: CUDA out of memory. (But I have 24G memory available)

Hi, I run the train.py script of the reference code.

But got cuda out of memory error shown below:

Start training Epoch: [0] [ 0/2507] eta: 0:15:11 lr: 0.000040 loss: 3.9314 (3.9314) loss_classifier: 3.1079 (3.1079) loss_box_reg: 0.0669 (0.0669) loss_objectness: 0.6984 (0.6984) loss_rpn_box_reg: 0.0583 (0.0583) time: 0.3635 data: 0.0372 max mem: 2224 Epoch: [0] [ 20/2507] eta: 0:13:41 lr: 0.000440 loss: 3.4063 (3.2239) loss_classifier: 2.6814 (2.4672) loss_box_reg: 0.0116 (0.0211) loss_objectness: 0.6958 (0.6942) loss_rpn_box_reg: 0.0332 (0.0413) time: 0.3285 data: 0.0173 max mem: 2657 Epoch: [0] [ 40/2507] eta: 0:13:35 lr: 0.000839 loss: 1.2476 (2.1992) loss_classifier: 0.5336 (1.5398) loss_box_reg: 0.1127 (0.0682) loss_objectness: 0.3905 (0.5478) loss_rpn_box_reg: 0.0425 (0.0434) time: 0.3309 data: 0.0168 max mem: 2657 Epoch: [0] [ 60/2507] eta: 0:13:24 lr: 0.001239 loss: 0.5988 (1.7346) loss_classifier: 0.3086 (1.1685) loss_box_reg: 0.1608 (0.1021) loss_objectness: 0.1539 (0.4219) loss_rpn_box_reg: 0.0258 (0.0421) time: 0.3249 data: 0.0168 max mem: 2821 Epoch: [0] [ 80/2507] eta: 0:13:17 lr: 0.001638 loss: 0.5835 (1.4710) loss_classifier: 0.3268 (0.9616) loss_box_reg: 0.1620 (0.1204) loss_objectness: 0.0840 (0.3487) loss_rpn_box_reg: 0.0230 (0.0402) time: 0.3278 data: 0.0178 max mem: 2821 Epoch: [0] [ 100/2507] eta: 0:13:09 lr: 0.002038 loss: 0.4755 (1.2898) loss_classifier: 0.2139 (0.8241) loss_box_reg: 0.1554 (0.1294) loss_objectness: 0.0611 (0.2994) loss_rpn_box_reg: 0.0190 (0.0369) time: 0.3258 data: 0.0158 max mem: 2821 Traceback (most recent call last): File "train.py", line 235, in main(args) File "train.py", line 208, in main train_one_epoch(model, optimizer, data_loader, device, epoch, args.print_freq) File "/home/songhongguang/lwh/vision-master/references/detection/engine.py", line 30, in train_one_epoch loss_dict = model(images, targets) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 78, in forward images, targets = self.transform(images, targets) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 104, in forward image, target_index = self.resize(image, target_index) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 151, in resize image, target = _resize_image_and_masks(image, size, float(self.max_size), target) File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 52, in _resize_image_and_masks mask = F.interpolate(mask[:, None].float(), scale_factor=scale_factor, recompute_scale_factor=True)[:, 0].byte() File "/home/songhongguang/anaconda3/envs/torchdistill/lib/python3.7/site-packages/torch/nn/functional.py", line 3532, in interpolate return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors) RuntimeError: CUDA out of memory. Tried to allocate 20.51 GiB (GPU 0; 23.87 GiB total capacity; 5.62 GiB already allocated; 17.22 GiB free; 5.87 GiB reserved in total by PyTorch)

I'm confused because I have 24G memory available:

figure1_nvidia_before

I have tried to set batch_size to 2, num_workers=0, also pin_memory=False.

Could you please tell me the reason? Thank you!

pytorch / vision

RuntimeError: CUDA out of memory. (But I have 24G memory available) #4149