Closed connor-john closed 2 years ago
After checking https://github.com/Microsoft/onnxruntime/releases/tag/v1.8.1 , I noticed that ort-gpu 1.8.1
only support cu10.1~cu11.1 while torch1.11 depends on cu11.3 .
May it helps.
Thanks for spotting that @tpoisonooo ,
After creating a new environment as:
pytorch 1.7.1
cudatoolkit 11.0.3
cudnn 8.0.4
onnxruntime-gpu 1.8.1
I still get the same error
Full traceback
2022-06-03 09:51:29.663267904 [E:onnxruntime:Default, cuda_call.cc:117 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=connor ; expr=cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize);
2022-06-03 09:51:29.663309062 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_6_Relu_7' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize)
2022-06-03 09:51:29.663563714 [E:onnxruntime:Default, cuda_call.cc:117 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=connor ; expr=cudaEventRecord(current_deferred_release_event, static_cast<cudaStream_t>(GetComputeStream()));
Traceback (most recent call last):
File "main.py", line 80, in <module>
main()
File "main.py", line 63, in main
bbox_xyxy, cls_conf, cls_ids = inference_model(model, img)
File "main.py", line 10, in inference_model
bbox_result = model([img])
File "/home/connor/Documents/github/mmdeploy-test-onnx/model.py", line 107, in __call__
outputs = self._forward({'input': input_img})
File "/home/connor/Documents/github/mmdeploy-test-onnx/model.py", line 43, in _forward
self.ort_session.run_with_iobinding(self.io_binding)
File "/home/connor/anaconda3/envs/onnx/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 229, in run_with_iobinding
self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running FusedConv node. Name:'Conv_6_Relu_7' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize)
Aborted (core dumped)
I got mmdeploy test.py
to run on GPU
python tools/test.py
configs/mmdet/detection/detection_onnxruntime_dynamic.py
work_dir/faster_rcnn_r50_fpn_1x_coco.py
--model work_dir/end2end.onnx
--device cuda:0
env:
pytorch 1.7.1
cudatoolkit 11.0.3
cudnn 8.0.4
onnxruntime-gpu 1.8.1
mmdeploy 0.4.0
mmdet 2.20.0
mmcv-full 1.4.0
I had to add some of the symbolic helper functions from torch.onnx.symbolic_helper
to mmdeploy/pytorch/ops/instance_norm.py
since my torch version < 1.8.0, just so mmdeploy would run
def _is_tensor(x):
return x.type().isSubtypeOf(torch._C.TensorType.get())
def _get_tensor_rank(x):
if not _is_tensor(x) or x.type() is None:
return None
return x.type().dim()
def _get_tensor_sizes(x, allow_nonstatic=True):
if not _is_tensor(x) or x.type() is None:
return None
if allow_nonstatic:
# Each individual symbol is returned as None.
# e.g. [1, 'a', 'b'] -> [1, None, None]
return x.type().varyingSizes()
# returns None, if exists any symbol in sizes.
# e.g. [1, 'a', 'b'] -> None
return x.type().sizes()
def _get_tensor_dim_size(x, dim):
try:
sizes = _get_tensor_sizes(x)
return sizes[dim]
except Exception:
pass
return None
Any idea why I always get the previous comment error whenever I inference with the onnx model on GPU outside of mmdeploy test.py
?
Any insights into what solves the CuDNN runtime error is greatly appreciated, thanks
What is your host CUDA & driver version ? Pytorch installation comes with an CUDA toolkit, Is that matched with your host ?
I thought you would upgrade the ort-gpu to 1.11.x ... As I know mmdeploy not fully tested torch1.7.
The CUDNN error meaning:
cudnnFindConvolutionForwardAlgorithmEx
but failedThe CUDA error 700:
Overall, it looks like version mismatch.
Give me the onnx model if it's convenient, I give you a configuration of an environment that works properly.
My recommendation:
pytorch 1.11.0
cudatoolkit 11.3
cudnn 8.2.1
Open $WORK_DIR, there is an end2end.onnx
.
Unit test ort-gpu inference with end2end.onnx
Is running nvcc --version and nvidia-smi sufficient for getting host CUDA information? if thats the case:
nvcc --version
Cuda compilation tools, release 10.1, V10.1.243
nvidia-smi
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
Will having out of sync CUDA in conda env vs. host, be the likely issue?
It still works with tools/test.py
though.
If thats the case I will re-setup my system with your above recommended env of
pytorch 1.11.0
cudatoolkit 11.3
cudnn 8.2.1
I will provide the ONNX model after that if I am still having issues,
Thanks again @tpoisonooo
Quick question can I have higher MMCV version than 1.4.0
as recommended in the docs?
I had issues installing MMCV==1.4.0
with higher pytorch version 1.11.0
or should I just compile MMCV from source instead?
Check mmcv & torch version here.
cuda11.3 needs driver >= 465
Thanks @tpoisonooo for your help,
Using your recommended env did help,
I was able to make it work in my old environment after getting similar fails in new env,
Found that the issue was input tensor wasn't being moved to GPU device in my test inference code, this shouldnt affect anyone else, just noting down incase someone else has similar issue
Created an ONNX model for mmdet
faster_rcnn_r50_fpn_1x_coco
model both as onnx_static and onnx_dynamic,Creating the model works, and testing on CPU works,
When testing the model on GPU, with
onnxruntime-gpu==1.8.1
, both produce CUDNN error:The inferencing is the same as in mmdeploy
Env.
Is this an Issue with how the mmdetection model was created or onnxruntime-gpu specific issue? Any help is appreciated, thank you