jackroos / VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
MIT License
735 stars 110 forks source link

RuntimeError:CUDA out of memory #49

Closed liulijie-2020 closed 3 years ago

liulijie-2020 commented 4 years ago

when have trained a fine-tuned VCR model nearly approaching a day, the error happened. Traceback (most recent call last): File "vcr/train_end2end.py", line 59, in <module> main() File "vcr/train_end2end.py", line 53, in main rank, model = train_net(args, config) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../vcr/function/train.py", line 337, in train_net gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/trainer.py", line 115, in train outputs, loss = net(*batch) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/module.py", line 22, in forward return self.train_forward(*inputs, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../vcr/modules/resnet_vlbert_for_vcr.py", line 340, in train_forward output_text_and_object_separately=True) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/nlp/time_distributed.py", line 35, in forward reshaped_outputs = self._module(*reshaped_inputs, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/visual_linguistic_bert.py", line 140, in forward output_attention_probs=output_attention_probs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 410, in forward hidden_states = layer_module(hidden_states, attention_mask, output_attention_probs=output_attention_probs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 392, in forward intermediate_output = self.intermediate(attention_output) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 362, in forward hidden_states = self.dense(hidden_states) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward return F.linear(input, self.weight, self.bias) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear output = input.matmul(weight.t()) RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 11.91 GiB total capacity; 10.43 GiB already allocated; 12.06 MiB free; 348.09 MiB cached) I tryed training by 2 or 3 GPUs. And tryed to reduced Batch by changing LOG_FREQUENT from 100 to 2, but no use. The error still happened in one day train. I hope can get some help for it.

jackroos commented 4 years ago

The LOG_FREQUENT is just logging frequency, you need to reduce batch size by changing the option "BATCH_IMAGES".

liulijie-2020 commented 4 years ago

The LOG_FREQUENT is just logging frequency, you need to reduce batch size by changing the option "BATCH_IMAGES".

Thank you very much. It worked out. I have changed "BATCH_IMAGES" to 2, and used 4 gpus. But when it was training, the ID0 Memory-Usage still raised from 7692MiB / 12212MiB to 10082MiB / 12210MiB by several hours. I'm worry about if it will be out of memory. Unfortunately. Something I was worried about happened.ID0 Memory-Usage has raised to 11362MiB / 12210MiB after another 3 hours.