microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.6k stars 2.77k forks source link

OpenVINO: Encountered unknown exception in Run() #20069

Open mertalev opened 3 months ago

mertalev commented 3 months ago

Describe the issue

When using OpenVINO, the session can be created, but calling run leads to the error: RuntimeException: [ONNXRuntimeError] RUNTIME_EXCEPTION: Encountered unknown exception. Based on reports in this issue, there seems to be a pattern with the N100 CPU in particular.

This seems to be a regression as this error only appears after upgrading to 1.17.1 of onnxruntime-openvino with OpenVINO 2023.3.0. This model worked when using 1.15.0 and OpenVINO 2023.1.0.

After enabling the following environmental variables:

ORT_OPENVINO_ENABLE_CI_LOG=1
ORT_OPENVINO_ENABLE_DEBUG=1
OPENVINO_LOG_LEVEL=5

There are a few additional logs, but none that seem pertinent:

In the OpenVINO EP
Model is fully supported on OpenVINO
CreateNgraphFunc

To reproduce

With onnxruntime-openvino 1.17.1 and OpenVINO 2023.3.0, create a session including the following providers:

['OpenVINOExecutionProvider', 'CPUExecutionProvider']

And the following provider options:

[{'device_type': 'GPU_FP32', 'cache_dir': '/tmp/facial-recognition/buffalo_l/openvino'}, {'arena_extend_strategy': 'kSameAsRequested'}] 

Then attempt to run inference with this model. It may or may not work depending on the CPU.

You may use this image to have the exact software environment producing the issue: ghcr.io/immich-app/immich-machine-learning@sha256:01799596c7f40495887d4027df1c0f4c144c7cd6ab34937ef2cc14d246470095

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

OpenVINO

Execution Provider Library Version

2023.3.0

jywu-msft commented 3 months ago

+@sfatimar , @preetha-intel

Disty0 commented 3 months ago

Having the same error using this model: https://huggingface.co/SmilingWolf/wd-swinv2-tagger-v3 Other variants of this model (convnext and vit) runs fine but swinv2 fails with the following logs:

Using onnxruntime-openvino==1.17.1 on Arch Linux 6.8.2 with this script: https://github.com/kohya-ss/sd-scripts/blob/dev/finetune/tag_images_by_wd14_tagger.py

Command:

ORT_OPENVINO_ENABLE_CI_LOG=1 ORT_OPENVINO_ENABLE_DEBUG=1 OPENVINO_LOG_LEVEL=5 python finetune/tag_images_by_wd14_tagger.py --model_dir "~/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --append_tags --onnx --caption_separator ", " --batch_size 1 --caption_extension ".txt" point_to_a_folder_with_images/

On CPU with OpenVINO {'device_type': 'CPU_FP32'}):

``` 2024-03-29 20:38:42,374 - __main__ - INFO - loading onnx model: /mnt/DataSSD/AI/models/wd14_tagger_model/SmilingWolf_wd-swinv2-tagger-v3/model.onnx In the OpenVINO EP CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc 2024-03-29 20:38:43.498619390 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-03-29 20:38:43.498633230 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-03-29 20:38:44,003 - __main__ - INFO - found 151 images. 0%| | 0/151 [00:00 main(args) File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 321, in main run_batch(b_imgs) File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 199, in run_batch probs = ort_sess.run(None, {input_name: imgs})[0] # onnx output numpy File "/mnt/DataSSD/AI/Apps/ipex/kohya_ss/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Encountered unknown exception in Run() ```

On an Intel ARC A770 ({'device_type': 'GPU.0_FP32'} or {'device_type': 'GPU_FP32'}):

``` 2024-03-29 20:30:00,068 - __main__ - INFO - loading onnx model: /mnt/DataSSD/AI/models/wd14_tagger_model/SmilingWolf_wd-swinv2-tagger-v3/model.onnx In the OpenVINO EP CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc CreateNgraphFunc 2024-03-29 20:30:01.866096452 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-03-29 20:30:01.866111062 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-03-29 20:30:02,360 - __main__ - INFO - found 151 images. 0%| | 0/151 [00:00 main(args) File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 321, in main run_batch(b_imgs) File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 199, in run_batch probs = ort_sess.run(None, {input_name: imgs})[0] # onnx output numpy File "/mnt/DataSSD/AI/Apps/ipex/kohya_ss/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Encountered unknown exception in Run() ```

On an AMD RX 7900 XTX {'device_type': 'GPU.1_FP32'}): (Different error and more useful logs)

``` 2024-03-29 20:35:50,890 - __main__ - INFO - loading onnx model: /mnt/DataSSD/AI/models/wd14_tagger_model/SmilingWolf_wd-swinv2-tagger-v3/model.onnx In the OpenVINO EP CreateNgraphFunc lld: error: undefined hidden symbol: _fc_bf_tiled_kernel_default_fully_connected_gpu_bf_tiled_12472420788504233070_0__sa >>> referenced by /tmp/comgr-3b4438/input/linked.bc.o:(fully_connected_gpu_bf_tiled_12472420788504233070_0__sa) >>> referenced by /tmp/comgr-3b4438/input/linked.bc.o:(fully_connected_gpu_bf_tiled_12472420788504233070_0__sa) Error: Creating the executable from LLVM IRs failed. .... lld: error: undefined hidden symbol: _fc_bf_tiled_kernel_default_fully_connected_gpu_bf_tiled_5508805082171153497_0__sa >>> referenced by /tmp/comgr-9d5597/input/linked.bc.o:(fully_connected_gpu_bf_tiled_5508805082171153497_0__sa) >>> referenced by /tmp/comgr-9d5597/input/linked.bc.o:(fully_connected_gpu_bf_tiled_5508805082171153497_0__sa) Error: Creating the executable from LLVM IRs failed. .... 2024-03-29 20:35:53.575513812 [E:onnxruntime:, inference_session.cc:1985 Initialize] Encountered unknown exception in Initialize() Traceback (most recent call last): File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 448, in main(args) File "/mnt/DataSSD/AI/Apps/rocm/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py", line 148, in main ort_sess = ort.InferenceSession( File "/mnt/DataSSD/AI/Apps/ipex/kohya_ss/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__ self._create_inference_session(providers, provider_options, disabled_optimizers) File "/mnt/DataSSD/AI/Apps/ipex/kohya_ss/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Encountered unknown exception in Initialize() ```

Same script runs fine on CPUExecutionProvider and ROCmExecutionProvider (from https://pypi.lsh.sh/60/onnxruntime-training/).

sfatimar commented 3 months ago

Yes there is a regression Binary DLL uploaded in github.com/intel/onnxruntime for 1.17.1 onnxruntime-openvino is compatible only with OpenVINO 2023.3.0. There was a change in Exception API on OpenVINO which was not handled properly so backward compatibility was broken. But it is possible to build 1.17.1 code with OpenVINO 2023.1.0 and execute ...

Disty0 commented 3 months ago

I tried buıilding from the main branch (commit id: a2998e5d425d48f83f817f6503c6bd196e2cb3ae) and it runs fine now.

Had to use OpenVINO 2023.3 since building with 2024.0 segfaulted on import. 2023.3 still fails when running tests but it runs fine for my use case so i added --skip_tests to build.sh.

Ran into the same issue as https://github.com/intel/neural-speed/issues/188 and had to add --compile_no_warning_as_error to build.sh.

Build command:

./build.sh --config RelWithDebInfo --use_openvino GPU_FP32 --parallel --build_shared_lib --build_wheel --compile_no_warning_as_error --skip_tests
mertalev commented 3 months ago

Yes there is a regression Binary DLL uploaded in github.com/intel/onnxruntime for 1.17.1 onnxruntime-openvino is compatible only with OpenVINO 2023.3.0.

There was a change in Exception API on OpenVINO which was not handled properly so backward compatibility was broken. But it is possible to build 1.17.1 code with OpenVINO 2023.1.0 and execute ...

To clarify, the issue is occurring with 2023.3.0. The pattern I'm seeing is that it works on CPUs with Iris Xe graphics, but not on CPUs with UHD graphics.

shummo commented 2 months ago

Is it possible to DOWNGRADE to 1.15.0 onnxruntime-openvino with OpenVINO 2023.1.0? If yes, how with docker compose (sorry but I"m not an expert) Thanks

henryruhs commented 1 month ago

I have to confirm, that this is an existing issue that breaks OpenVINO using Intel Arc (770) under Windows.

@shummo A downgrade to onnxruntime==1.15.0 and openvino=2023.1.0 solved it

ankitm3k commented 1 month ago

Hi @mertalev, I have tested your model this and it's inferencing successfully on both Windows 11 and Ubuntu 22.04 for CPU and GPU. I would recommend you and community towards using latest OpenVINO Toolkit v2024.1 and OpenVINO EP v1.18.0 which will be available soon in the upcoming ONNXRuntime release. You can also build and find OpenVINO EP from source for the same.

mertalev commented 1 month ago

Thanks for the testing and update! We'll upgrade to 1.18.0 and 2024.1.0 once the former is available.

When you mention that it works on GPU, can you clarify if you tested with an iGPU like UHD Graphics, or a dGPU like Arc (and I believe Iris Xe is also counted as a dGPU). iGPUs struggle, but I haven't seen anyone with a dGPU have an issue with this model.

ankitm3k commented 1 month ago

Thanks for the testing and update! We'll upgrade to 1.18.0 and 2024.1.0 once the former is available.

When you mention that it works on GPU, can you clarify if you tested with an iGPU like UHD Graphics, or a dGPU like Arc (and I believe Iris Xe is also counted as a dGPU). iGPUs struggle, but I haven't seen anyone with a dGPU have an issue with this model.

It's tested on a Meteor Lake architecture processor CPU (Intel Core Ultra 7 1003H) comprising of iGPU (Intel Arc Graphics). I'd recommend you to try your application on multiple platforms. As suggested above, you can also use the main branch of this repository to build OpenVINO EP to get the latest wheels for your work environment and reproduce the same.

Snuupy commented 1 month ago

Thanks for the update!

It's tested on a Meteor Lake architecture processor CPU (Intel Core Ultra 7 1003H) comprising of iGPU (Intel Arc Graphics).

Hi, could you please add Intel UHD graphics to your list of test cases/testing setup because it is currently broken on that (but not XE graphics, so if it works on your testing setup, this may still be broken)

henryruhs commented 1 month ago

@ankitm3k No, onnxruntime-openvino does not work with latest OpenVino 2024.1 ... just release 1.18.0 finally

ankitm3k commented 1 month ago

Thanks for the testing and update! We'll upgrade to 1.18.0 and 2024.1.0 once the former is available.

When you mention that it works on GPU, can you clarify if you tested with an iGPU like UHD Graphics, or a dGPU like Arc (and I believe Iris Xe is also counted as a dGPU). iGPUs struggle, but I haven't seen anyone with a dGPU have an issue with this model.

Hi @mertalev @Snuupy

I have tested your model with our C++ onnxruntime_perf_test app built from source and it runs inference successfully with below machine configurations - Machine 1 - OS: Windows 11 CPU: Raptor Lake arch iGPU: Intel UHD Graphics 770 dGPU: Intel Arc A380 Graphics

Machine 2 - OS: Windows 11 CPU: i7-1270P iGPU: Intel Iris Xe Graphics

I'd recommend you to use either intel's repo master or rel-1.18.0 (https://github.com/intel/onnxruntime.git) or directly from Microsoft's master branch (https://github.com/microsoft/onnxruntime.git) or rel-1.18.0 (https://github.com/microsoft/onnxruntime/commits/rel-1.18.0/) for building the wheels using below command -

image

Discover the wheel in below example path and install using pip - pip install ${CWD}\onnxruntime\build\Windows\Debug\Debug\dist\onnxruntime_openvino-1.19.0-cp310-cp310-win_amd64.whl

Snuupy commented 4 weeks ago

@ankitm3k

I have tested your model with our C++ onnxruntime_perf_test app built from source and it runs inference successfully with below machine configurations - Machine 1 - OS: Windows 11 CPU: Raptor Lake arch iGPU: Intel UHD Graphics 770 dGPU: Intel Arc A380 Graphics

Doesn't onnxruntime default to the dgpu if one is provided? So it will run on the A380 (working) instead of the UHD 770 igpu (not working)

I can try substituting 1.18 to see if that fixes anything regardless.

ankitm3k commented 3 weeks ago

@ankitm3k

I have tested your model with our C++ onnxruntime_perf_test app built from source and it runs inference successfully with below machine configurations - Machine 1 - OS: Windows 11 CPU: Raptor Lake arch iGPU: Intel UHD Graphics 770 dGPU: Intel Arc A380 Graphics

Doesn't onnxruntime default to the dgpu if one is provided? So it will run on the A380 (working) instead of the UHD 770 igpu (not working)

I can try substituting 1.18 to see if that fixes anything regardless.

The onnxruntime default device_type for GPU is iGPU (GPU.0) and if you want to explicitly use dGPU (GPU.1) then set your device _type as GPU.1 during your inference provider options.

Please build your onnxruntime-openvino wheels from the main branch and install it in your python virtual env so that you can get the latest release changes. This should solve your issues.