jwyang / graph-rcnn.pytorch

[ECCV 2018] Official code for "Graph R-CNN for Scene Graph Generation"
732 stars 157 forks source link

CUDA out of memory while trying to inference. #70

Closed sangminwoo closed 3 years ago

sangminwoo commented 4 years ago

Hi. I was trying to evaluate the model, which was trained in step-wise(detector - sgg) manner.

But I've got an issue of OOM problem.

At first time I try with single gpu, and even used 2 gpus but OOM still occurred. (Both gpus are RTX2080Ti)

Does anybody who have same trouble with me?

And is this a problem that could happen even with two gpu?


(graph) woo@IRRLab:~/graph-rcnn.pytorch$ python -m torch.distributed.launch --nproc_per_node=2 main.py --config-file configs/sgg_res101_step.yaml --inference --resume 15000 --visualize 2019-10-29 20:09:42,210 scene_graph_generation INFO: Namespace(algorithm='sg_baseline', config_file='configs/sgg_res101_step.yaml', distributed=True, inference=True, instance=-1, local_rank=0, resume=15000, use_freq_prior=False, visualize=True) 2019-10-29 20:09:42,210 scene_graph_generation INFO: Loaded configuration file configs/sgg_res101_step.yaml 2019-10-29 20:09:42,210 scene_graph_generation INFO: Saving config into: logs/config.yml images_per_batch: 8, num_gpus: 2 images_per_batch: 1, num_gpus: 2 2019-10-29 20:09:56,317 scene_graph_generation.trainer INFO: Train data size: 56224 2019-10-29 20:09:56,317 scene_graph_generation.trainer INFO: Test data size: 26446 2019-10-29 20:09:57,122 scene_graph_generation.checkpointer INFO: Loading checkpoint from checkpoints/vg_benchmark_object/R-101-C4/faster_rcnn/BatchSize_6/Base_LR_0.005/checkpoint_0099999.pth 2019-10-29 20:09:57,408 scene_graph_generation.inference INFO: Start evaluating 2019-10-29 20:09:57,580 scene_graph_generation.inference INFO: inference on batch 0/13223... 2019-10-29 20:10:01,040 scene_graph_generation.inference INFO: inference on batch 10/13223... Traceback (most recent call last): File "main.py", line 127, in main() File "main.py", line 124, in main test(cfg, args) File "main.py", line 80, in test model.test(visualize=args.visualize) File "/home/woo/graph-rcnn.pytorch/lib/model.py", line 232, in test output = self.scene_parser(imgs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/parser.py", line 133, in forward x_pairs, detection_pairs, rel_heads_loss = self.rel_heads(relation_features, detections, targets) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/relation_heads.py", line 139, in forward self.rel_predictor(features, proposals, proposal_pairs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/baseline/baseline.py", line 26, in forward x, rel_inds = self.pred_feature_extractor(features, proposals, proposal_pairs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/roi_relation_feature_extractors.py", line 61, in forward x = self._union_box_feats(x, proposal_pairs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/roi_relation_feature_extractors.py", line 46, in _union_box_feats x = self.head(x_union) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 203, in forward x = getattr(self, stage)(x) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 339, in forward identity = self.downsample(x) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/layers/batch_norm.py", line 31, in forward return x scale + bias RuntimeError: CUDA out of memory. Tried to allocate 1.51 GiB (GPU 0; 10.76 GiB total capacity; 7.96 GiB already allocated; 1007.31 MiB free; 456.64 MiB cached) Traceback (most recent call last): File "main.py", line 127, in main() File "main.py", line 124, in main test(cfg, args) File "main.py", line 80, in test model.test(visualize=args.visualize) File "/home/woo/graph-rcnn.pytorch/lib/model.py", line 232, in test output = self.scene_parser(imgs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/parser.py", line 133, in forward x_pairs, detection_pairs, rel_heads_loss = self.rel_heads(relation_features, detections, targets) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/relation_heads.py", line 139, in forward self.rel_predictor(features, proposals, proposal_pairs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/baseline/baseline.py", line 26, in forward x, rel_inds = self.pred_feature_extractor(features, proposals, proposal_pairs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/roi_relation_feature_extractors.py", line 61, in forward x = self._union_box_feats(x, proposal_pairs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/relation_heads/roi_relation_feature_extractors.py", line 46, in _union_box_feats x = self.head(x_union) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 203, in forward x = getattr(self, stage)(x) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 339, in forward identity = self.downsample(x) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/woo/.conda/envs/graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/woo/graph-rcnn.pytorch/lib/scene_parser/rcnn/layers/batch_norm.py", line 31, in forward return x * scale + bias RuntimeError: CUDA out of memory. Tried to allocate 1.51 GiB (GPU 1; 10.76 GiB total capacity; 7.90 GiB already allocated; 1.32 GiB free; 524.12 MiB cached)


digbose92 commented 4 years ago

I am using the command : python main.py --config-file configs/sgg_res101_step.yaml --inference --resume 39999 . I am also getting the OOM issue during inference. RuntimeError: CUDA out of memory. Tried to allocate 1.51 GiB (GPU 2; 10.73 GiB total capacity; 7.96 GiB already allocated; 1.46 GiB free; 460.89 MiB c$ ched). Any solutions ?

jwyang commented 4 years ago

@sangminwoo @digbose92 which sg_algorithm did you use? During inference, all object pairs will be computed for relationship detection. that's why it might exceed the memory maximum. One quick solution is you can constrain the number of object proposals to be less than a certain number, say 50, then you will probably solve this issue.

sangminwoo commented 4 years ago

Thank you for kind reply.

I've tried sg_baseline and trained it in step-wise manner. issue has been resolved when I reduced size of image to almost 1/10.

using sg_grcnn and training jointly seems fine in test phase.

L6-hong commented 3 years ago

@sangminwoo Hello, I have encountered the same problem as you. I only have one GPU. What parts of the code have you modified to make the program run? Please help solve this problem, thank you.

L6-hong commented 3 years ago

I have encountered the same memory problem. How should I solve this problem? I have reduced the data set to 10 pictures now. But I still report the same mistake. I very much hope to solve this problem. Could you give me a suggestion? Thank you.

sangminwoo commented 3 years ago

@L6-hong Maybe you should first try to reduce the "batch size" and "image size".

L6-hong commented 3 years ago

@sangminwoo First of all, thank you very much for your reply and help. I have modified it according to your method. I changed batch size to 1, and changed "image size" to 128 or even 64, but the same problem still occurred. I have modified the corresponding parameters, but there is still a problem, so I guess there is something wrong with my parameter settings. Do you have any other suggestions? In the meantime, could you share your revised code?

sangminwoo commented 3 years ago

@L6-hong I had solved the OOM problem a long time ago, but unfortunately can't remember what I modified. All I can do to help you is just pointing out what you've missed. What is your setup? and please show your error code.

L6-hong commented 3 years ago

@sangminwoo Hello, first of all, thank you very much for your reply. In the code, I set batch size to 1 and image size to 64. I also modified ROI_HEADS.DETECTIONS_PER_IMG to 1 and changed RPN.BATCH_SIZE_PERIMAG from 256 to 1. However, the same problem still appears. The following is the result when I run "pythonmain.py-config-fileconfigs/faster rcnn _ res101.yaml".Do you have any solutions? Thank you.

2020-12-18 09:21:50,268 scene_graph_generation INFO: Namespace(algorithm='sg_baseline', batchsize=1, config_file='configs/faster_rcnn_res101.yaml', distributed=False, inference=False, instance=-1, local_rank=0, resume=0, session=0, use_freq_prior=False, visualize=False) 2020-12-18 09:21:50,269 scene_graph_generation INFO: Loaded configuration file configs/faster_rcnn_res101.yaml 2020-12-18 09:21:50,269 scene_graph_generation INFO: Saving config into: logs/config.yml images_per_batch: 1, num_gpus: 1 images_per_batch: 1, num_gpus: 1 2020-12-18 09:21:50,300 scene_graph_generation.trainer INFO: Train data size: 1 2020-12-18 09:21:50,300 scene_graph_generation.trainer INFO: Test data size: 1 2020-12-18 09:21:50,300 scene_graph_generation.trainer INFO: Computing frequency prior matrix... processing 0/1 2020-12-18 09:21:54,839 scene_graph_generation.checkpointer INFO: Loading checkpoint from catalog://ImageNetPretrained/MSRA/R-101 2020-12-18 09:21:54,840 scene_graph_generation.checkpointer INFO: catalog://ImageNetPretrained/MSRA/R-101 points to https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-101.pkl 2020-12-18 09:21:54,840 scene_graph_generation.checkpointer INFO: url https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-101.pkl cached in /home/wh/.torch/models/R-101.pkl 2020-12-18 09:21:55,023 scene_graph_generation.checkpointer INFO: missed keys: ['backbone.body.layer1.0.bn1.running_mean', 'backbone.body.layer1.0.bn1.running_var', 'backbone.body.layer1.0.bn2.running_mean', 'backbone.body.layer1.0.bn2.running_var', 'backbone.body.layer1.0.bn3.running_mean', 'backbone.body.layer1.0.bn3.running_var', 'backbone.body.layer1.0.downsample.1.running_mean', 'backbone.body.layer1.0.downsample.1.running_var', 'backbone.body.layer1.1.bn1.running_mean', 'backbone.body.layer1.1.bn1.running_var', 'backbone.body.layer1.1.bn2.running_mean', 'backbone.body.layer1.1.bn2.running_var', 'backbone.body.layer1.1.bn3.running_mean', 'backbone.body.layer1.1.bn3.running_var', 'backbone.body.layer1.2.bn1.running_mean', 'backbone.body.layer1.2.bn1.running_var', 'backbone.body.layer1.2.bn2.running_mean', 'backbone.body.layer1.2.bn2.running_var', 'backbone.body.layer1.2.bn3.running_mean', 'backbone.body.layer1.2.bn3.running_var', 'backbone.body.layer2.0.bn1.running_mean', 'backbone.body.layer2.0.bn1.running_var', 'backbone.body.layer2.0.bn2.running_mean', 'backbone.body.layer2.0.bn2.running_var', 'backbone.body.layer2.0.bn3.running_mean', 'backbone.body.layer2.0.bn3.running_var', 'backbone.body.layer2.0.downsample.1.running_mean', 'backbone.body.layer2.0.downsample.1.running_var', 'backbone.body.layer2.1.bn1.running_mean', 'backbone.body.layer2.1.bn1.running_var', 'backbone.body.layer2.1.bn2.running_mean', 'backbone.body.layer2.1.bn2.running_var', 'backbone.body.layer2.1.bn3.running_mean', 'backbone.body.layer2.1.bn3.running_var', 'backbone.body.layer2.2.bn1.running_mean', 'backbone.body.layer2.2.bn1.running_var', 'backbone.body.layer2.2.bn2.running_mean', 'backbone.body.layer2.2.bn2.running_var', 'backbone.body.layer2.2.bn3.running_mean', 'backbone.body.layer2.2.bn3.running_var', 'backbone.body.layer2.3.bn1.running_mean', 'backbone.body.layer2.3.bn1.running_var', 'backbone.body.layer2.3.bn2.running_mean', 'backbone.body.layer2.3.bn2.running_var', 'backbone.body.layer2.3.bn3.running_mean', 'backbone.body.layer2.3.bn3.running_var', 'backbone.body.layer3.0.bn1.running_mean', 'backbone.body.layer3.0.bn1.running_var', 'backbone.body.layer3.0.bn2.running_mean', 'backbone.body.layer3.0.bn2.running_var', 'backbone.body.layer3.0.bn3.running_mean', 'backbone.body.layer3.0.bn3.running_var', 'backbone.body.layer3.0.downsample.1.running_mean', 'backbone.body.layer3.0.downsample.1.running_var', 'backbone.body.layer3.1.bn1.running_mean', 'backbone.body.layer3.1.bn1.running_var', 'backbone.body.layer3.1.bn2.running_mean', 'backbone.body.layer3.1.bn2.running_var', 'backbone.body.layer3.1.bn3.running_mean', 'backbone.body.layer3.1.bn3.running_var', 'backbone.body.layer3.10.bn1.running_mean', 'backbone.body.layer3.10.bn1.running_var', 'backbone.body.layer3.10.bn2.running_mean', 'backbone.body.layer3.10.bn2.running_var', 'backbone.body.layer3.10.bn3.running_mean', 'backbone.body.layer3.10.bn3.running_var', 'backbone.body.layer3.11.bn1.running_mean', 'backbone.body.layer3.11.bn1.running_var', 'backbone.body.layer3.11.bn2.running_mean', 'backbone.body.layer3.11.bn2.running_var', 'backbone.body.layer3.11.bn3.running_mean', 'backbone.body.layer3.11.bn3.running_var', 'backbone.body.layer3.12.bn1.running_mean', 'backbone.body.layer3.12.bn1.running_var', 'backbone.body.layer3.12.bn2.running_mean', 'backbone.body.layer3.12.bn2.running_var', 'backbone.body.layer3.12.bn3.running_mean', 'backbone.body.layer3.12.bn3.running_var', 'backbone.body.layer3.13.bn1.running_mean', 'backbone.body.layer3.13.bn1.running_var', 'backbone.body.layer3.13.bn2.running_mean', 'backbone.body.layer3.13.bn2.running_var', 'backbone.body.layer3.13.bn3.running_mean', 'backbone.body.layer3.13.bn3.running_var', 'backbone.body.layer3.14.bn1.running_mean', 'backbone.body.layer3.14.bn1.running_var', 'backbone.body.layer3.14.bn2.running_mean', 'backbone.body.layer3.14.bn2.running_var', 'backbone.body.layer3.14.bn3.running_mean', 'backbone.body.layer3.14.bn3.running_var', 'backbone.body.layer3.15.bn1.running_mean', 'backbone.body.layer3.15.bn1.running_var', 'backbone.body.layer3.15.bn2.running_mean', 'backbone.body.layer3.15.bn2.running_var', 'backbone.body.layer3.15.bn3.running_mean', 'backbone.body.layer3.15.bn3.running_var', 'backbone.body.layer3.16.bn1.running_mean', 'backbone.body.layer3.16.bn1.running_var', 'backbone.body.layer3.16.bn2.running_mean', 'backbone.body.layer3.16.bn2.running_var', 'backbone.body.layer3.16.bn3.running_mean', 'backbone.body.layer3.16.bn3.running_var', 'backbone.body.layer3.17.bn1.running_mean', 'backbone.body.layer3.17.bn1.running_var', 'backbone.body.layer3.17.bn2.running_mean', 'backbone.body.layer3.17.bn2.running_var', 'backbone.body.layer3.17.bn3.running_mean', 'backbone.body.layer3.17.bn3.running_var', 'backbone.body.layer3.18.bn1.running_mean', 'backbone.body.layer3.18.bn1.running_var', 'backbone.body.layer3.18.bn2.running_mean', 'backbone.body.layer3.18.bn2.running_var', 'backbone.body.layer3.18.bn3.running_mean', 'backbone.body.layer3.18.bn3.running_var', 'backbone.body.layer3.19.bn1.running_mean', 'backbone.body.layer3.19.bn1.running_var', 'backbone.body.layer3.19.bn2.running_mean', 'backbone.body.layer3.19.bn2.running_var', 'backbone.body.layer3.19.bn3.running_mean', 'backbone.body.layer3.19.bn3.running_var', 'backbone.body.layer3.2.bn1.running_mean', 'backbone.body.layer3.2.bn1.running_var', 'backbone.body.layer3.2.bn2.running_mean', 'backbone.body.layer3.2.bn2.running_var', 'backbone.body.layer3.2.bn3.running_mean', 'backbone.body.layer3.2.bn3.running_var', 'backbone.body.layer3.20.bn1.running_mean', 'backbone.body.layer3.20.bn1.running_var', 'backbone.body.layer3.20.bn2.running_mean', 'backbone.body.layer3.20.bn2.running_var', 'backbone.body.layer3.20.bn3.running_mean', 'backbone.body.layer3.20.bn3.running_var', 'backbone.body.layer3.21.bn1.running_mean', 'backbone.body.layer3.21.bn1.running_var', 'backbone.body.layer3.21.bn2.running_mean', 'backbone.body.layer3.21.bn2.running_var', 'backbone.body.layer3.21.bn3.running_mean', 'backbone.body.layer3.21.bn3.running_var', 'backbone.body.layer3.22.bn1.running_mean', 'backbone.body.layer3.22.bn1.running_var', 'backbone.body.layer3.22.bn2.running_mean', 'backbone.body.layer3.22.bn2.running_var', 'backbone.body.layer3.22.bn3.running_mean', 'backbone.body.layer3.22.bn3.running_var', 'backbone.body.layer3.3.bn1.running_mean', 'backbone.body.layer3.3.bn1.running_var', 'backbone.body.layer3.3.bn2.running_mean', 'backbone.body.layer3.3.bn2.running_var', 'backbone.body.layer3.3.bn3.running_mean', 'backbone.body.layer3.3.bn3.running_var', 'backbone.body.layer3.4.bn1.running_mean', 'backbone.body.layer3.4.bn1.running_var', 'backbone.body.layer3.4.bn2.running_mean', 'backbone.body.layer3.4.bn2.running_var', 'backbone.body.layer3.4.bn3.running_mean', 'backbone.body.layer3.4.bn3.running_var', 'backbone.body.layer3.5.bn1.running_mean', 'backbone.body.layer3.5.bn1.running_var', 'backbone.body.layer3.5.bn2.running_mean', 'backbone.body.layer3.5.bn2.running_var', 'backbone.body.layer3.5.bn3.running_mean', 'backbone.body.layer3.5.bn3.running_var', 'backbone.body.layer3.6.bn1.running_mean', 'backbone.body.layer3.6.bn1.running_var', 'backbone.body.layer3.6.bn2.running_mean', 'backbone.body.layer3.6.bn2.running_var', 'backbone.body.layer3.6.bn3.running_mean', 'backbone.body.layer3.6.bn3.running_var', 'backbone.body.layer3.7.bn1.running_mean', 'backbone.body.layer3.7.bn1.running_var', 'backbone.body.layer3.7.bn2.running_mean', 'backbone.body.layer3.7.bn2.running_var', 'backbone.body.layer3.7.bn3.running_mean', 'backbone.body.layer3.7.bn3.running_var', 'backbone.body.layer3.8.bn1.running_mean', 'backbone.body.layer3.8.bn1.running_var', 'backbone.body.layer3.8.bn2.running_mean', 'backbone.body.layer3.8.bn2.running_var', 'backbone.body.layer3.8.bn3.running_mean', 'backbone.body.layer3.8.bn3.running_var', 'backbone.body.layer3.9.bn1.running_mean', 'backbone.body.layer3.9.bn1.running_var', 'backbone.body.layer3.9.bn2.running_mean', 'backbone.body.layer3.9.bn2.running_var', 'backbone.body.layer3.9.bn3.running_mean', 'backbone.body.layer3.9.bn3.running_var', 'backbone.body.stem.bn1.running_mean', 'backbone.body.stem.bn1.running_var', 'roi_heads.box.feature_extractor.head.layer4.0.bn1.running_mean', 'roi_heads.box.feature_extractor.head.layer4.0.bn1.running_var', 'roi_heads.box.feature_extractor.head.layer4.0.bn2.running_mean', 'roi_heads.box.feature_extractor.head.layer4.0.bn2.running_var', 'roi_heads.box.feature_extractor.head.layer4.0.bn3.running_mean', 'roi_heads.box.feature_extractor.head.layer4.0.bn3.running_var', 'roi_heads.box.feature_extractor.head.layer4.0.downsample.1.running_mean', 'roi_heads.box.feature_extractor.head.layer4.0.downsample.1.running_var', 'roi_heads.box.feature_extractor.head.layer4.1.bn1.running_mean', 'roi_heads.box.feature_extractor.head.layer4.1.bn1.running_var', 'roi_heads.box.feature_extractor.head.layer4.1.bn2.running_mean', 'roi_heads.box.feature_extractor.head.layer4.1.bn2.running_var', 'roi_heads.box.feature_extractor.head.layer4.1.bn3.running_mean', 'roi_heads.box.feature_extractor.head.layer4.1.bn3.running_var', 'roi_heads.box.feature_extractor.head.layer4.2.bn1.running_mean', 'roi_heads.box.feature_extractor.head.layer4.2.bn1.running_var', 'roi_heads.box.feature_extractor.head.layer4.2.bn2.running_mean', 'roi_heads.box.feature_extractor.head.layer4.2.bn2.running_var', 'roi_heads.box.feature_extractor.head.layer4.2.bn3.running_mean', 'roi_heads.box.feature_extractor.head.layer4.2.bn3.running_var', 'roi_heads.box.predictor.bbox_pred.bias', 'roi_heads.box.predictor.bbox_pred.weight', 'roi_heads.box.predictor.cls_score.bias', 'roi_heads.box.predictor.cls_score.weight', 'rpn.anchor_generator.cell_anchors.0', 'rpn.head.bbox_pred.bias', 'rpn.head.bbox_pred.weight', 'rpn.head.cls_logits.bias', 'rpn.head.cls_logits.weight', 'rpn.head.conv.bias', 'rpn.head.conv.weight'] 2020-12-18 09:21:55,266 scene_graph_generation.trainer INFO: Start training Traceback (most recent call last): File "main.py", line 92, in main() File "main.py", line 87, in main model = train(cfg, args) File "main.py", line 27, in train model.train() File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/model.py", line 131, in train loss_dict = self.scene_parser(imgs, targets) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/parser.py", line 121, in forward x, detections, roi_heads_loss = self.roi_heads(features, proposals, targets) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, *kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/modeling/roi_heads/roi_heads.py", line 23, in forward x, detections, loss_box = self.box(features, proposals, targets) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/modeling/roi_heads/box_head/box_head.py", line 48, in forward x = self.feature_extractor(features, proposals) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 45, in forward x = self.head(x) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, *kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 203, in forward x = getattr(self, stage)(x) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 96, in forward input = module(input) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/modeling/backbone/resnet.py", line 339, in forward identity = self.downsample(x) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, *kwargs) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 96, in forward input = module(input) File "/home/wh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/home/wh/LJH/LJH-RUN/graph-rcnn/lib/scene_parser/rcnn/layers/batch_norm.py", line 31, in forward return x * scale + bias RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 1.96 GiB total capacity; 1.16 GiB already allocated; 84.12 MiB free; 1.10 MiB cached)

sangminwoo commented 3 years ago

@L6-hong What GPU are you using? seems that your GPU memory has only 2GB capacity. As SGG consumes lots of memory, OOM is probable in your machine.

L6-hong commented 3 years ago

@sangminwoo Hello, I am very sorry for the delay in replying to you. My GPU memory is 4G, and I want to debug it on my laptop to test whether the small data set can run through. At present, I set the data of three pictures. Do three pictures also need a large GPU memory? Thank you for your reply.

sangminwoo commented 3 years ago

@L6-hong Maybe the answer can be found in #15

L6-hong commented 3 years ago

@sangminwoo Thank you very much for your reply.