Closed AmitYadu closed 4 years ago
2020-01-16 16:31:09,554 - root - INFO - Use Cuda. 2020-01-16 16:31:09,554 - root - INFO - Namespace(base_net_lr=None, batch_size=60, cache_path=None, checkpoint_folder='models/', datasets='./datasets/ILSVRC2015/', debug_steps=100, freeze_net=False, gamma=0.1, lr=0.0003, milestones='80,100', momentum=0.9, num_epochs=30, num_workers=1, pretrained=None, resume=None, scheduler='multi-step', sequence_length=10, ssd_lr=None, t_max=120, use_cuda=True, validation_epochs=5, weight_decay=0.0005, width_mult=1.0) 2020-01-16 16:31:09,576 - root - INFO - Prepare training datasets. class 2020-01-16 16:31:09,851 - root - INFO - using default Imagenet VID classes. 2020-01-16 16:31:10,226 - root - INFO - gt roidb loaded from datasets/ILSVRC2015/train_VID_seq_gt_db.pkl 2020-01-16 16:31:10,458 - root - INFO - Stored labels into file models/vid-model-labels.txt. 2020-01-16 16:31:10,458 - root - INFO - Train dataset size: 60 2020-01-16 16:31:10,458 - root - INFO - Build network. 2020-01-16 16:31:10,491 - root - INFO - Initializing weights of base net 2020-01-16 16:31:10,510 - root - INFO - Initializing weights of lstm 2020-01-16 16:31:19,438 - root - INFO - Initializing weights of SSD 2020-01-16 16:31:19,460 - root - INFO - Learning rate: 0.0003, Base net learning rate: 0.0003, Extra Layers learning rate: 0.0003. 2020-01-16 16:31:19,461 - root - INFO - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train_mvod_lstm1.py", line 292, in device=DEVICE, debug_steps=args.debug_steps, epoch=epoch, sequence_length=args.sequence_length) File "train_mvod_lstm1.py", line 132, in train loss.backward(retain_graph=True) File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 188.00 MiB (GPU 0; 15.90 GiB total capacity; 14.55 GiB already allocated; 138.88 MiB free; 179.10 MiB cached)
Like the error states, you dont have sufficient VRAM on your GPU. Did you free the GPU from all other trainings / Cuda scripts?
2020-01-16 16:31:09,554 - root - INFO - Use Cuda. 2020-01-16 16:31:09,554 - root - INFO - Namespace(base_net_lr=None, batch_size=60, cache_path=None, checkpoint_folder='models/', datasets='./datasets/ILSVRC2015/', debug_steps=100, freeze_net=False, gamma=0.1, lr=0.0003, milestones='80,100', momentum=0.9, num_epochs=30, num_workers=1, pretrained=None, resume=None, scheduler='multi-step', sequence_length=10, ssd_lr=None, t_max=120, use_cuda=True, validation_epochs=5, weight_decay=0.0005, width_mult=1.0) 2020-01-16 16:31:09,576 - root - INFO - Prepare training datasets. class 2020-01-16 16:31:09,851 - root - INFO - using default Imagenet VID classes. 2020-01-16 16:31:10,226 - root - INFO - gt roidb loaded from datasets/ILSVRC2015/train_VID_seq_gt_db.pkl 2020-01-16 16:31:10,458 - root - INFO - Stored labels into file models/vid-model-labels.txt. 2020-01-16 16:31:10,458 - root - INFO - Train dataset size: 60 2020-01-16 16:31:10,458 - root - INFO - Build network. 2020-01-16 16:31:10,491 - root - INFO - Initializing weights of base net 2020-01-16 16:31:10,510 - root - INFO - Initializing weights of lstm 2020-01-16 16:31:19,438 - root - INFO - Initializing weights of SSD 2020-01-16 16:31:19,460 - root - INFO - Learning rate: 0.0003, Base net learning rate: 0.0003, Extra Layers learning rate: 0.0003. 2020-01-16 16:31:19,461 - root - INFO - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train_mvod_lstm1.py", line 292, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch, sequence_length=args.sequence_length)
File "train_mvod_lstm1.py", line 132, in train
loss.backward(retain_graph=True)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 188.00 MiB (GPU 0; 15.90 GiB total capacity; 14.55 GiB already allocated; 138.88 MiB free; 179.10 MiB cached)