visinf / irr

Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation (CVPR 2019)
Apache License 2.0
194 stars 32 forks source link

RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR #44

Open lyq998 opened 2 years ago

lyq998 commented 2 years ago

Hi, I met the error when running script IRR-PWC_flyingChairs.sh. And here are my environment: pytorch 0.4.1, cuda 8.0, cudnn 7.0.1

2022-03-04 17:26:08 ==> Commandline Arguments 2022-03-04 17:26:08 batch_size: 4 2022-03-04 17:26:08 batch_size_val: 4 2022-03-04 17:26:08 checkpoint: saved_check_point/pwcnet/IRR-PWC_flyingchairsOcc/checkpoint_best.ckpt 2022-03-04 17:26:08 checkpoint_exclude_params: [''] 2022-03-04 17:26:08 checkpoint_include_params: ['*'] 2022-03-04 17:26:08 checkpoint_mode: resume_from_latest 2022-03-04 17:26:08 cuda: True 2022-03-04 17:26:08 evaluation: True 2022-03-04 17:26:08 lr_scheduler: None 2022-03-04 17:26:08 model: IRR_PWC 2022-03-04 17:26:08 model_div_flow: 0.05 2022-03-04 17:26:08 name: run 2022-03-04 17:26:08 num_iters: 1 2022-03-04 17:26:08 num_workers: 4 2022-03-04 17:26:08 optimizer: Adam 2022-03-04 17:26:08 optimizer_amsgrad: False 2022-03-04 17:26:08 optimizer_betas: (0.9, 0.999) 2022-03-04 17:26:08 optimizer_eps: 1e-08 2022-03-04 17:26:08 optimizer_group: None 2022-03-04 17:26:08 optimizer_lr: 0.001 2022-03-04 17:26:08 optimizer_weight_decay: 0 2022-03-04 17:26:08 save: saved_check_point/pwcnet/eval_temp/IRR_PWC 2022-03-04 17:26:08 save_result_bidirection: False 2022-03-04 17:26:08 save_result_flo: False 2022-03-04 17:26:08 save_result_img: False 2022-03-04 17:26:08 save_result_occ: False 2022-03-04 17:26:08 save_result_path_name: 2022-03-04 17:26:08 save_result_png: False 2022-03-04 17:26:08 seed: 1 2022-03-04 17:26:08 start_epoch: 1 2022-03-04 17:26:08 total_epochs: 10 2022-03-04 17:26:08 training_augmentation: None 2022-03-04 17:26:08 training_dataset: None 2022-03-04 17:26:08 training_loss: None 2022-03-04 17:26:08 validation_augmentation: None 2022-03-04 17:26:08 validation_dataset: SintelTrainingCleanFull 2022-03-04 17:26:08 validation_dataset_photometric_augmentations: False 2022-03-04 17:26:08 validation_dataset_root: /home/liuyuqiao/MPI-Sintel-complete/ 2022-03-04 17:26:08 validation_key: epe 2022-03-04 17:26:08 validation_key_minimize: True 2022-03-04 17:26:08 validation_loss: MultiScaleEPE_PWC_Bi_Occ_upsample 2022-03-04 17:26:08 ==> Random Seeds 2022-03-04 17:26:08 Python seed: 1 2022-03-04 17:26:08 Numpy seed: 2 2022-03-04 17:26:08 Torch CPU seed: 3 2022-03-04 17:26:08 Torch CUDA seed: 4 2022-03-04 17:26:08 ==> Datasets 2022-03-04 17:26:08 Validation Dataset: SintelTrainingCleanFull 2022-03-04 17:26:08 basedir: training/clean/alley_1 2022-03-04 17:26:08 input1: [3, 436, 1024] 2022-03-04 17:26:08 input2: [3, 436, 1024] 2022-03-04 17:26:08 target1: [2, 436, 1024] 2022-03-04 17:26:08 target_occ1: [1, 436, 1024] 2022-03-04 17:26:08 num_examples: 1041 2022-03-04 17:26:08 ==> Runtime Augmentations 2022-03-04 17:26:08 training_augmentation: None 2022-03-04 17:26:08 validation_augmentation: None 2022-03-04 17:26:08 ==> Model and Loss 2022-03-04 17:26:08 Initializing MSRA /usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:114: UserWarning: Found GPU0 GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for optimal performance and fast startup time, but your PyTorch was compiled with CUDA_VERSION 8000. Please install the correct PyTorch binary using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION)) 2022-03-04 17:33:19 Batch Size: 4 2022-03-04 17:33:19 GPGPU: Cuda 2022-03-04 17:33:19 Network: IRR_PWC 2022-03-04 17:33:19 Number of parameters: 6362092 2022-03-04 17:33:19 Validation Key: epe 2022-03-04 17:33:19 Validation Loss: MultiScaleEPE_PWC_Bi_Occ_upsample 2022-03-04 17:33:19 ==> Checkpoint 2022-03-04 17:33:19 ==> Save Directory 2022-03-04 17:33:19 Save directory: saved_check_point/pwcnet/eval_temp/IRR_PWC 2022-03-04 17:33:19 ==> Optimizer 2022-03-04 17:33:19 Adam 2022-03-04 17:33:19 amsgrad: False 2022-03-04 17:33:19 betas: (0.9, 0.999) 2022-03-04 17:33:19 eps: 1e-08 2022-03-04 17:33:19 lr: 0.001 2022-03-04 17:33:19 weight_decay: 0 2022-03-04 17:33:19 ==> Learning Rate Scheduler 2022-03-04 17:33:19 class: None 2022-03-04 17:33:19 ==> Runtime 2022-03-04 17:33:19 start_epoch: 1 2022-03-04 17:33:19 total_epochs: 1

==> Progress: 0%| | 0/1 00:00<? ?s/ep

2022-03-04 17:33:19 ==> Epoch 1/1 ==> Validate: 0%| | 0/261 00:00<? ?it/s
Traceback (most recent call last): File "../../main.py", line 89, in main() File "../../main.py", line 86, in main validation_augmentation=validation_augmentation) File "/home/liuyuqiao/irr/runtime.py", line 555, in exec_runtime augmentation=validation_augmentation).run() File "/home/liuyuqiao/irr/runtime.py", line 427, in run loss_dict_per_step, output_dict, batch_size = self._step(example_dict) File "/home/liuyuqiao/irr/runtime.py", line 384, in _step loss_dict, output_dict = self._model_and_loss(example_dict) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/liuyuqiao/irr/configuration.py", line 49, in forward output_dict = self._model(example_dict) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/liuyuqiao/irr/models/IRR_PWC.py", line 59, in forward x1_pyramid = self.feature_pyramid_extractor(x1_raw) + [x1_raw] File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/liuyuqiao/irr/models/pwc_modules.py", line 101, in forward x = conv(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

Looking forward to your reply

hurjunhwa commented 2 years ago

Hi,

Can it be due to this warning in your log?

/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:114: UserWarning: Found GPU0 GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for optimal performance and fast startup time, but your PyTorch was compiled with CUDA_VERSION 8000. Please install the correct PyTorch binary using instructions from http://pytorch.org/

lyq998 commented 2 years ago

This warning is due to torch 0.4.1 that I used. To avoid this warning, I tried to change the version of torch to 1.1.0 and 1.5.0, however, a new error has occurred: Traceback (most recent call last): File "../../main.py", line 5, in <module> import commandline File "/home/liuyuqiao/irr_37/irr/commandline.py", line 14, in <module> import models File "/home/liuyuqiao/irr_37/irr/models/__init__.py", line 8, in <module> from . import pwcnet File "/home/liuyuqiao/irr_37/irr/models/pwcnet.py", line 8, in <module> from .correlation_package.correlation import Correlation File "/home/liuyuqiao/irr_37/irr/models/correlation_package/correlation.py", line 4, in <module> import correlation_cuda ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory And I used cuda 10.0 this time, but the importerror is from 8.0. Do you have any idea?

hurjunhwa commented 2 years ago

Ah, for the PWC-Net baseline in the pwcnet.py, I didn't update it with the current version of the correlation layer. Could you edit it yourself following this comment? https://github.com/visinf/irr/issues/43#issuecomment-957798684

lyq998 commented 2 years ago

I have tried both versions: ` # correlation out_corr_f = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x1, x2_warp) out_corr_b = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x2, x1_warp)

out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params)

            # out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)`

and ` # correlation

out_corr_f = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x1, x2_warp)

            # out_corr_b = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x2, x1_warp)
            out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params)
            out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)`

but they cannot work. And the error is RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR. I'm using torch==0.4.1, because 1.5.0 or 1.1.0 will lead ImportError: libcudart.so.8.0.

hurjunhwa commented 2 years ago

What's the error message when you use this?

out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params) 
out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)

I am not so sure, but the problem rather comes from a version mismatch between pytorch and cuda/cudnn.

Also comment this line

from .correlation_package.correlation import Correlation

so that the source code doesn't import the correlation layer written in CUDA.