kexinyi / ns-vqa

Neural-symbolic visual question answering
255 stars 63 forks source link

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED #7

Closed zaynmi closed 5 years ago

zaynmi commented 5 years ago

I installed the environment requirements following every step as README.md said, but CUDNN error occured at the Step 1:object detection. with the cmd:

python tools/train_net_step.py \ --dataset clevr-mini \ --cfg configs/baselines/e2e_mask_rcnn_R-50-FPN_1x.yaml \ --bs 8 \ --set OUTPUT_DIR ../../data/mask_rcnn/outputs

Here is the Detail Error:

INFO test_engine.py: 331: loading checkpoint ../../data/pretrained/object_detector.pt
Traceback (most recent call last):
  File "tools/test_net.py", line 126, in <module>
    check_expected_results=True)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test_engine.py", line 129, in run_inference
    all_results = result_getter()
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test_engine.py", line 109, in result_getter
    multi_gpu=multi_gpu_testing
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test_engine.py", line 159, in test_net_on_dataset
    args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test_engine.py", line 254, in test_net
    cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test.py", line 71, in im_detect_all
    model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/core/test.py", line 152, in im_detect_bbox
    return_dict = model(**inputs)
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/nn/parallel/data_parallel.py", line 108, in forward
    outputs = [self.module(*inputs[0], **kwargs[0])]
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/modeling/model_builder.py", line 144, in forward
    return self._forward(data, im_info, roidb, **rpn_kwargs)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/modeling/model_builder.py", line 155, in _forward
    blob_conv = self.Conv_Body(im_data)
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tang/ns-vqa-master/scene_parse/mask_rcnn/lib/modeling/FPN.py", line 228, in forward
    conv_body_blobs = [self.conv_body.res1(x)]
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tang/anaconda3/envs/ns-vqa/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

It happened when load checkpoint test_engine.py: 331: loading checkpoint ../../data/pretrained/object_detector.pt Then I try the other steps or Train, the same error appears.

The system information: Ubuntu16.04 RTX2080Ti cuda 9.0.176
cuDNN 7.1.2 pytorch 0.4.0 python3.6.7 The code is running in the conda virtual environment. cuda9.0 and cuDNN 7.3.1 are in base environment. are they linked with make.sh file?

I have tried many solutions from google, e.g. change version of cuda and cuDNN. But still, I got same error. I also tried other project mac-network, and it is working fine on GPU in same virtual environment.

Sincerely hope your reply! Thanks

zaynmi commented 5 years ago

I solved this problem by reinstall CUDA9.2, pytorch 0.4.1