I am trying to train VL-BERT for RefCOCO+ (python refcoco/train_end2end.py --cfg cfgs/refcoco/base_detected_regions_4x16G.yaml). However, I am getting the following CUDA-related error.
THCudaCheck FAIL file=/project/ocean/tsriniva/VL-BERT/common/lib/roi_pooling/cuda/ROIAlign_cuda.cu line=297 error=98 : invalid device function
Traceback (most recent call last):
File "refcoco/train_end2end.py", line 60, in <module>
main()
File "refcoco/train_end2end.py", line 54, in main
rank, model = train_net(args, config)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../refcoco/function/train.py", line 323, in train_net
gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/trainer.py", line 115, in train
outputs, loss = net(*batch)
File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/module.py", line 22, in forward
return self.train_forward(*inputs, **kwargs)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../refcoco/modules/resnet_vlbert_for_refcoco.py", line 96, in train_forward
segms=None)
File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/fast_rcnn.py", line 149, in forward
roi_align_res = self.roi_align(img_feats['body4'], rois).type(images.dtype)
File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/lib/roi_pooling/roi_align.py", line 69, in forward
input.float(), rois.float(), self.output_size, self.spatial_scale, self.sampling_ratio
File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/lib/roi_pooling/roi_align.py", line 20, in forward
input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio
RuntimeError: cuda runtime error (98) : invalid device function at /project/ocean/tsriniva/VL-BERT/common/lib/roi_pooling/cuda/ROIAlign_cuda.cu:297
Segmentation fault (core dumped)
I am trying to train VL-BERT for RefCOCO+ (
python refcoco/train_end2end.py --cfg cfgs/refcoco/base_detected_regions_4x16G.yaml
). However, I am getting the following CUDA-related error.Is there any fix for this?