jackroos / VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
MIT License
735 stars 110 forks source link

CUDA Illegal Memory Access In FastRCNN with ROIAlign #72

Open tkreiman opened 3 years ago

tkreiman commented 3 years ago

After line 162 in fast_rcnn.py which runs the following code: roi_align_res = self.roi_align(img_feats['body4'], rois).type(images.dtype)

I get the following CUDA Memory error with the bounding boxes tensor when trying to print it out (the same error occurs later on in the code on the first access to the variable boxes, but I pinpointed that after line 162 runs this error starts happening): Traceback (most recent call last):

 File "vqa/mytrain_end2end.py", line 65, in <module>
    main()
  File "vqa/mytrain_end2end.py", line 57, in main
    rank, model = train_net(args, config)
  File "/home/gabriel/Desktop/Toby/VL-BERT/vqa/../vqa/function/mytrain.py", line 336, in train_net
    gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS)
  File "/home/gabriel/Desktop/Toby/VL-BERT/vqa/../common/trainer.py", line 115, in train
    outputs, loss = net(*batch)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/gabriel/Desktop/Toby/VL-BERT/vqa/../common/module.py", line 22, in forward
    return self.train_forward(*inputs, **kwargs)
  File "/home/gabriel/Desktop/Toby/VL-BERT/vqa/../vqa/modules/myresnet_vlbert_for_vqa.py", line 203, in train_forward
    segms=None)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/gabriel/Desktop/Toby/VL-BERT/vqa/../common/fast_rcnn.py", line 163, in forward
    print("3", boxes)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/tensor.py", line 179, in __repr__
    return torch._tensor_str._str(self)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/_tensor_str.py", line 372, in _str
    return _str_intern(self)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/_tensor_str.py", line 352, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/gabriel/anaconda3/envs/vl-bert/lib/python3.6/site-packages/torch/_tensor_str.py", line 89, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: an illegal memory access was encountered

My environment is as follows:

torchvision 0.8.2
pytorch 1.7.1
cudatoolkit 11.0.221

I wonder if there is a different ROI align I could use instead, or ways to get around this issue. Thanks for the help.