Open L-Reichardt opened 2 years ago
I had a similar problem when running about 12 different models. I got the error only at the 8th model, and saw I'm actually out of GPU memory... I see you have 12GB of Memory, but maybe...
I got a similar error:
RuntimeError: Error in execution: Non-zero status code returned while running FusedConv node. Name:'Conv_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:40 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: input
Turns out I was checking for Cuda availability before setting io_binding
for my input. I only set the binding if Cuda was available but used sess.run_with_iobinding(self.io_binding)
in both Cuda and CPU cases. The fix was to remove the conditional around the bind_cpu_input
. ie:
#### BAD ####
if ort.get_device() == "GPU":
self.io_binding.bind_cpu_input("input", preproc_crops)
self.sess.run_with_iobinding(self.io_binding)
ort_outputs = self.io_binding.copy_outputs_to_cpu()
#############
#### GOOD ####
self.io_binding.bind_cpu_input("input", preproc_crops)
self.sess.run_with_iobinding(self.io_binding)
ort_outputs = self.io_binding.copy_outputs_to_cpu()
#############
In the docs that I only read as a last resort :-p it says io_binding.bind_cpu_input('input', X)
will "copy the data over to the CUDA device if 'input' is consumed by nodes on the CUDA device". So if 'input' is not consumed by nodes of the CUDA device it will not do the copy, but 'input' will still be passed to the CPU.
The error in my original thinking was that the bind was only required if GPU was available but that meant the 'input' was not bound to any device in the case where GPU was not available.
Describe the bug Inference breaking onnxruntime-gpu/CUDA error when attempting to use ONNX model converted from tensorflow .pb file. Model works fine on CPU. Tensorflow model itself works fine on GPU. Found no related Issue or solution related to this Issue. Models and logs found here Google Drive (file format blocked by github)
Urgency Urgent: by Monday 21.03.2022
System information
To Reproduce
python -m tf2onnx.convert --input ./f32_input_face.pb --output ./model_face.onnx --inputs image_tensor:0 --outputs num_detections:0,detection_scores:0,detection_boxes:0 --output_frozen_graph ./optim_face.pb --opset 11
.... def(....): np_images = image.astype(np.float32, copy=False) np_images = np_images[np.newaxis, ...] # Adds batch dimension num_boxes, scores, boxes = self.session.run(None, {self.input_name: np_images})
[E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNALERROR ; GPU=0 ; hostname=MEM-WAS-04 ; expr=cudnnFindConvolutionForwardAlgorithmEx( s.handle, s_.xtensor, s.xdata, s.wdesc, s.wdata, s.convdesc, s.ytensor, s.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size); 2022-03-16 17:42:29.548068687 [E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D_FirstStageFeatureExtractor/resnet_v1_101/resnet_v1101/conv1/Relu' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( s.handle, s_.xtensor, s.xdata, s.wdesc, s.wdata, s.convdesc, s.ytensor, s.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size) Traceback (most recent call last): File "anonymize.py", line 186, in
main(image_extensions=args.image_extensions,
File "anonymize.py", line 179, in main
anonymizer.anonymize_images(input_path=input_folder, output_path=output_folder,
File "anonymize.py", line 144, in anonymize_images
anonymized_image, detections = self.anonymize_image(image=image, detection_thresholds=detection_thresholds)
File "anonymize.py", line 121, in anonymize_image
new_boxes = detector.detect(image, detection_threshold=detection_thresholds[kind])
File "/home/labor/Documents/Project/face_lisence_plate_anonymization/anonymizer/detection/detector.py", line 78, in detect
num_boxes, scores, boxes = self.session.run(None, {self.input_name: np_images})
File "/home/labor/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 192, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/conv1/Conv2D_FirstStageFeatureExtractor/resnet_v1_101/resnet_v1101/conv1/Relu' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( s.handle, s_.xtensor, s.xdata, s.w_desc,