When training, CUDA runs out of memory - How can I reduce the batch size?

jamalknight commented 4 years ago

When I ran the train command I got an error that CUDA is out of memory. Could this be a batch size issue?

Is this where I can change the batch size?

File: human36m_vol_softmax.yaml

Lines 17+ 18: batch_size: 5 val_batch_size: 10

What would be a good batch size to try?

Command:

python3 train.py --config experiments/human36m/train/human36m_vol_softmax.yaml --logdir ./logs

Error:

args: Namespace(config='experiments/human36m/train/human36m_vol_softmax.yaml', eval=False, eval_dataset='val', local_rank=None, logdir='./logs', seed=42) Number of available GPUs: 1 Loading pretrained weights from: ./data/pretrained/human36m/pose_resnet_4.5_pixels_human36m.pth Reiniting final layer filters: module.final_layer.weight Reiniting final layer biases: module.final_layer.bias Successfully loaded pretrained weights for backbone Loading data... Experiment name: human36m_vol_softmax_VolumetricTriangulationNet@25.06.2020-17:58:32 Traceback (most recent call last): File "train.py", line 483, in main(args) File "train.py", line 462, in main n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment_dir, writer=writer) File "train.py", line 191, in one_epoch keypoints_3d_pred, heatmaps_pred, volumes_pred, confidences_pred, cuboids_pred, coord_volumes_pred, base_points_pred = model(images_batch, proj_matricies_batch, batch) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/media/jamal/jknight3TB/projects/learnable-triangulation-pytorch/mvn/models/triangulation.py", line 253, in forward heatmaps, features, , vol_confidences = self.backbone(images) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/media/jamal/jknight_3TB/projects/learnable-triangulation-pytorch/mvn/models/pose_resnet.py", line 301, in forward x = self.layer3(x) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/media/jamal/jknight_3TB/projects/learnable-triangulation-pytorch/mvn/models/pose_resnet.py", line 79, in forward out = self.bn1(out) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, **kwargs) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward exponential_average_factor, self.eps) File "/home/jamal/anaconda3/envs/learnable_triangulation_1/lib/python3.6/site-packages/torch/nn/functional.py", line 1623, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 11.25 MiB (GPU 0; 7.92 GiB total capacity; 6.35 GiB already allocated; 8.56 MiB free; 444.50 KiB cached)

shrubb commented 4 years ago

Is this where I can change the batch size?

Yes.

What would be a good batch size to try?

The largest that fits on your GPU (though I'm afraid that might be as low as 1).

jamalknight commented 4 years ago

Thanks

karfly / learnable-triangulation-pytorch

When training, CUDA runs out of memory - How can I reduce the batch size? #89