microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

[BUG] [OpenVino EP] Only first result in session is correct. #19975

Open debugmenot opened 5 months ago

debugmenot commented 5 months ago

Describe the issue

When running inference session ONLY with OpenVino EP and ORT > 1.13.1 any results except first are incorrect. There are no issues with ORT == 1.13.1 or CPU/CUDA/XNNPACK on any ORT version.

Getting this issue only on one model (Attention OCR) - model structure you can find at the bottom, other models works fine. seems there are some layers/functions in it that was broken after 1.13.1 build...

Description:

Ubuntu 22.04, Onnxruntime 1.17.1, OpenVino 2023.3, C++ Model: sort of Attention Decoder OCR, converted to onnx from pytorch.

Issue: im inferencing the same image (also tried on sequence of different images durning the inference session). Only the FIRST result is correct. Second result and so on looks like partially "cropped" first result doesnt matter if next input data is new... For example inferencing sequence of images with text "1234567890", "ABCDEFGHJK", "7777777777". Getting: "1234567890", "1200120012", "1200120012"...

Downgrade to ORT 1.13.1 solved the issue, but seems that something is broken after 1.13.1 build. All other EP (CPU, CUDA, XNNPACK) works well with the same code.

Found one reference to similar issue in OpenVino github: https://github.com/openvinotoolkit/openvino/issues/12966

Enabled verbose mode and found that node placements are differ between 1.17.1 (incorrect) and 1.13.1(correct) inference sessions, maybe it's matters, but doesn't explain why first result is always correct...:

correct inference session with node placements(1.13.1):

* Node placements
*Node(s) placed on [OpenVINOExecutionProvider]. Number of nodes: 11

OpenVINO-EP-subgraph_1 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0)
OpenVINO-EP-subgraph_2 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1)
OpenVINO-EP-subgraph_3 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_3_2)
OpenVINO-EP-subgraph_4 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_3)
OpenVINO-EP-subgraph_5 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_4)
OpenVINO-EP-subgraph_6 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_6_5)
OpenVINO-EP-subgraph_7 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_7_6)
OpenVINO-EP-subgraph_8 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_8_7)
OpenVINO-EP-subgraph_9 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_9_8)
OpenVINO-EP-subgraph_10 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_10_9)
OpenVINO-EP-subgraph_11 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_11_10)
*Node(s) placed on [CPUExecutionProvider]. Number of nodes: 167
GRU (/decoder/rnn/GRU)
LogSoftmax (/decoder/LogSoftmax)
ArgMax (/decoder/ArgMax)
Unsqueeze (/decoder/Unsqueeze)
Transpose (/decoder/Transpose_2)
Gather (/decoder/emb_1/Gather)
Expand (/decoder/attention_1/Expand)
Transpose (/decoder/attention_1/Transpose)
Concat (/decoder/attention_1/Concat)
MatMul (/decoder/attention/attn_1/MatMul)
Add (/decoder/attention/attn_1/Add)
Tanh (/decoder/attention_1/Tanh)
Softmax (/decoder/attention_1/Softmax)
MatMul (/decoder/MatMul_1)
Transpose (/decoder/Transpose_3)
Concat (/decoder/Concat_1)
GRU (/decoder/rnn_1/GRU)
LogSoftmax (/decoder/LogSoftmax_1)
ArgMax (/decoder/ArgMax_1)
Unsqueeze (/decoder/Unsqueeze_1)
Transpose (/decoder/Transpose_4)
Gather (/decoder/emb_2/Gather)
Expand (/decoder/attention_2/Expand)
Transpose (/decoder/attention_2/Transpose)
Concat (/decoder/attention_2/Concat)
MatMul (/decoder/attention/attn_2/MatMul)
Add (/decoder/attention/attn_2/Add)
Tanh (/decoder/attention_2/Tanh)
Softmax (/decoder/attention_2/Softmax)
MatMul (/decoder/MatMul_2)
Transpose (/decoder/Transpose_5)
Concat (/decoder/Concat_2)
GRU (/decoder/rnn_2/GRU)
LogSoftmax (/decoder/LogSoftmax_2)
ArgMax (/decoder/ArgMax_2)
Unsqueeze (/decoder/Unsqueeze_2)
Transpose (/decoder/Transpose_6)
Gather (/decoder/emb_3/Gather)
Expand (/decoder/attention_3/Expand)
Transpose (/decoder/attention_3/Transpose)
Concat (/decoder/attention_3/Concat)
MatMul (/decoder/attention/attn_3/MatMul)
Add (/decoder/attention/attn_3/Add)
Tanh (/decoder/attention_3/Tanh)
Softmax (/decoder/attention_3/Softmax)
MatMul (/decoder/MatMul_3)
Transpose (/decoder/Transpose_7)
Concat (/decoder/Concat_3)
GRU (/decoder/rnn_3/GRU)
LogSoftmax (/decoder/LogSoftmax_3)
ArgMax (/decoder/ArgMax_3)
Unsqueeze (/decoder/Unsqueeze_3)
Transpose (/decoder/Transpose_8)
Gather (/decoder/emb_4/Gather)
Expand (/decoder/attention_4/Expand)
Transpose (/decoder/attention_4/Transpose)
Concat (/decoder/attention_4/Concat)
MatMul (/decoder/attention/attn_4/MatMul)
Add (/decoder/attention/attn_4/Add)
Tanh (/decoder/attention_4/Tanh)
Softmax (/decoder/attention_4/Softmax)
MatMul (/decoder/MatMul_4)
Transpose (/decoder/Transpose_9)
Concat (/decoder/Concat_4)
GRU (/decoder/rnn_4/GRU)
LogSoftmax (/decoder/LogSoftmax_4)
ArgMax (/decoder/ArgMax_4)
Unsqueeze (/decoder/Unsqueeze_4)
Transpose (/decoder/Transpose_10)
Gather (/decoder/emb_5/Gather)
Expand (/decoder/attention_5/Expand)
Transpose (/decoder/attention_5/Transpose)
Concat (/decoder/attention_5/Concat)
MatMul (/decoder/attention/attn_5/MatMul)
Add (/decoder/attention/attn_5/Add)
Tanh (/decoder/attention_5/Tanh)
Softmax (/decoder/attention_5/Softmax)
MatMul (/decoder/MatMul_5)
Transpose (/decoder/Transpose_11)
Concat (/decoder/Concat_5)
GRU (/decoder/rnn_5/GRU)
LogSoftmax (/decoder/LogSoftmax_5)
ArgMax (/decoder/ArgMax_5)
Unsqueeze (/decoder/Unsqueeze_5)
Transpose (/decoder/Transpose_12)
Gather (/decoder/emb_6/Gather)
Expand (/decoder/attention_6/Expand)
Transpose (/decoder/attention_6/Transpose)
Concat (/decoder/attention_6/Concat)
MatMul (/decoder/attention/attn_6/MatMul)
Add (/decoder/attention/attn_6/Add)
Tanh (/decoder/attention_6/Tanh)
Softmax (/decoder/attention_6/Softmax)
MatMul (/decoder/MatMul_6)
Transpose (/decoder/Transpose_13)
Concat (/decoder/Concat_6)
GRU (/decoder/rnn_6/GRU)
LogSoftmax (/decoder/LogSoftmax_6)
ArgMax (/decoder/ArgMax_6)
Unsqueeze (/decoder/Unsqueeze_6)
Transpose (/decoder/Transpose_14)
Gather (/decoder/emb_7/Gather)
Expand (/decoder/attention_7/Expand)
Transpose (/decoder/attention_7/Transpose)
Concat (/decoder/attention_7/Concat)
MatMul (/decoder/attention/attn_7/MatMul)
Add (/decoder/attention/attn_7/Add)
Tanh (/decoder/attention_7/Tanh)
Softmax (/decoder/attention_7/Softmax)
MatMul (/decoder/MatMul_7)
Transpose (/decoder/Transpose_15)
Concat (/decoder/Concat_7)
GRU (/decoder/rnn_7/GRU)
LogSoftmax (/decoder/LogSoftmax_7)
ArgMax (/decoder/ArgMax_7)
Unsqueeze (/decoder/Unsqueeze_7)
Transpose (/decoder/Transpose_16)
Gather (/decoder/emb_8/Gather)
Expand (/decoder/attention_8/Expand)
Transpose (/decoder/attention_8/Transpose)
Concat (/decoder/attention_8/Concat)
MatMul (/decoder/attention/attn_8/MatMul)
Add (/decoder/attention/attn_8/Add)
Tanh (/decoder/attention_8/Tanh)
Softmax (/decoder/attention_8/Softmax)
MatMul (/decoder/MatMul_8)
Transpose (/decoder/Transpose_17)
Concat (/decoder/Concat_8)
GRU (/decoder/rnn_8/GRU)
LogSoftmax (/decoder/LogSoftmax_8)
ArgMax (/decoder/ArgMax_8)
Unsqueeze (/decoder/Unsqueeze_8)
Transpose (/decoder/Transpose_18)
Gather (/decoder/emb_9/Gather)
Expand (/decoder/attention_9/Expand)
Transpose (/decoder/attention_9/Transpose)
Concat (/decoder/attention_9/Concat)
MatMul (/decoder/attention/attn_9/MatMul)
Add (/decoder/attention/attn_9/Add)
Tanh (/decoder/attention_9/Tanh)
Softmax (/decoder/attention_9/Softmax)
MatMul (/decoder/MatMul_9)
Transpose (/decoder/Transpose_19)
Concat (/decoder/Concat_9)
GRU (/decoder/rnn_9/GRU)
LogSoftmax (/decoder/LogSoftmax_9)
Unsqueeze (/decoder/Unsqueeze_9)
Unsqueeze (/decoder/Unsqueeze_10)
Unsqueeze (/decoder/Unsqueeze_11)
Unsqueeze (/decoder/Unsqueeze_12)
Unsqueeze (/decoder/Unsqueeze_13)
Unsqueeze (/decoder/Unsqueeze_14)
Unsqueeze (/decoder/Unsqueeze_15)
Unsqueeze (/decoder/Unsqueeze_16)
Unsqueeze (/decoder/Unsqueeze_17)
Unsqueeze (/decoder/Unsqueeze_18)
Concat (/decoder/Concat_10)
Transpose (/decoder/Transpose_20)
FusedMatMul (MatMul_With_Transpose)
FusedMatMul (MatMul_With_Transpose_token_0)
FusedMatMul (MatMul_With_Transpose_token_1)
FusedMatMul (MatMul_With_Transpose_token_2)
FusedMatMul (MatMul_With_Transpose_token_3)
FusedMatMul (MatMul_With_Transpose_token_4)
FusedMatMul (MatMul_With_Transpose_token_5)
FusedMatMul (MatMul_With_Transpose_token_6)
FusedMatMul (MatMul_With_Transpose_token_7)

Incorrect inference result node placement (1.17.1)

* Node placements
*Node(s) placed on [OpenVINOExecutionProvider]. Number of nodes: 11

OpenVINO-EP-subgraph_1 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0)
OpenVINO-EP-subgraph_2 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1)
OpenVINO-EP-subgraph_3 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_3_2)
OpenVINO-EP-subgraph_4 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_3)
OpenVINO-EP-subgraph_5 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_4)
OpenVINO-EP-subgraph_6 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_6_5)
OpenVINO-EP-subgraph_7 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_7_6)
OpenVINO-EP-subgraph_8 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_8_7)
OpenVINO-EP-subgraph_9 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_9_8)
OpenVINO-EP-subgraph_10 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_10_9)
OpenVINO-EP-subgraph_11 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_11_10)
*Node(s) placed on [CPUExecutionProvider]. Number of nodes: 167
GRU (/decoder/rnn/GRU)
LogSoftmax (/decoder/LogSoftmax)
ArgMax (/decoder/ArgMax)
Unsqueeze (/decoder/Unsqueeze)
Transpose (/decoder/Transpose_2)
Gather (/decoder/emb_1/Gather)
Expand (/decoder/attention_1/Expand)
Transpose (/decoder/attention_1/Transpose)
Concat (/decoder/attention_1/Concat)
MatMul (/decoder/attention/attn_1/MatMul)
Add (/decoder/attention/attn_1/Add)
Tanh (/decoder/attention_1/Tanh)
Softmax (/decoder/attention_1/Softmax)
MatMul (/decoder/MatMul_1)
Transpose (/decoder/Transpose_3)
Concat (/decoder/Concat_1)
GRU (/decoder/rnn_1/GRU)
LogSoftmax (/decoder/LogSoftmax_1)
ArgMax (/decoder/ArgMax_1)
Unsqueeze (/decoder/Unsqueeze_1)
Transpose (/decoder/Transpose_4)
Gather (/decoder/emb_2/Gather)
Expand (/decoder/attention_2/Expand)
Transpose (/decoder/attention_2/Transpose)
Concat (/decoder/attention_2/Concat)
MatMul (/decoder/attention/attn_2/MatMul)
Add (/decoder/attention/attn_2/Add)
Tanh (/decoder/attention_2/Tanh)
Softmax (/decoder/attention_2/Softmax)
MatMul (/decoder/MatMul_2)
Transpose (/decoder/Transpose_5)
Concat (/decoder/Concat_2)
GRU (/decoder/rnn_2/GRU)
LogSoftmax (/decoder/LogSoftmax_2)
ArgMax (/decoder/ArgMax_2)
Unsqueeze (/decoder/Unsqueeze_2)
Transpose (/decoder/Transpose_6)
Gather (/decoder/emb_3/Gather)
Expand (/decoder/attention_3/Expand)
Transpose (/decoder/attention_3/Transpose)
Concat (/decoder/attention_3/Concat)
MatMul (/decoder/attention/attn_3/MatMul)
Add (/decoder/attention/attn_3/Add)
Tanh (/decoder/attention_3/Tanh)
Softmax (/decoder/attention_3/Softmax)
MatMul (/decoder/MatMul_3)
Transpose (/decoder/Transpose_7)
Concat (/decoder/Concat_3)
GRU (/decoder/rnn_3/GRU)
LogSoftmax (/decoder/LogSoftmax_3)
ArgMax (/decoder/ArgMax_3)
Unsqueeze (/decoder/Unsqueeze_3)
Transpose (/decoder/Transpose_8)
Gather (/decoder/emb_4/Gather)
Expand (/decoder/attention_4/Expand)
Transpose (/decoder/attention_4/Transpose)
Concat (/decoder/attention_4/Concat)
MatMul (/decoder/attention/attn_4/MatMul)
Add (/decoder/attention/attn_4/Add)
Tanh (/decoder/attention_4/Tanh)
Softmax (/decoder/attention_4/Softmax)
MatMul (/decoder/MatMul_4)
Transpose (/decoder/Transpose_9)
Concat (/decoder/Concat_4)
GRU (/decoder/rnn_4/GRU)
LogSoftmax (/decoder/LogSoftmax_4)
ArgMax (/decoder/ArgMax_4)
Unsqueeze (/decoder/Unsqueeze_4)
Transpose (/decoder/Transpose_10)
Gather (/decoder/emb_5/Gather)
Expand (/decoder/attention_5/Expand)
Transpose (/decoder/attention_5/Transpose)
Concat (/decoder/attention_5/Concat)
MatMul (/decoder/attention/attn_5/MatMul)
Add (/decoder/attention/attn_5/Add)
Tanh (/decoder/attention_5/Tanh)
Softmax (/decoder/attention_5/Softmax)
MatMul (/decoder/MatMul_5)
Transpose (/decoder/Transpose_11)
Concat (/decoder/Concat_5)
GRU (/decoder/rnn_5/GRU)
LogSoftmax (/decoder/LogSoftmax_5)
ArgMax (/decoder/ArgMax_5)
Unsqueeze (/decoder/Unsqueeze_5)
Transpose (/decoder/Transpose_12)
Gather (/decoder/emb_6/Gather)
Expand (/decoder/attention_6/Expand)
Transpose (/decoder/attention_6/Transpose)
Concat (/decoder/attention_6/Concat)
MatMul (/decoder/attention/attn_6/MatMul)
Add (/decoder/attention/attn_6/Add)
Tanh (/decoder/attention_6/Tanh)
Softmax (/decoder/attention_6/Softmax)
MatMul (/decoder/MatMul_6)
Transpose (/decoder/Transpose_13)
Concat (/decoder/Concat_6)
GRU (/decoder/rnn_6/GRU)
LogSoftmax (/decoder/LogSoftmax_6)
ArgMax (/decoder/ArgMax_6)
Unsqueeze (/decoder/Unsqueeze_6)
Transpose (/decoder/Transpose_14)
Gather (/decoder/emb_7/Gather)
Expand (/decoder/attention_7/Expand)
Transpose (/decoder/attention_7/Transpose)
Concat (/decoder/attention_7/Concat)
MatMul (/decoder/attention/attn_7/MatMul)
Add (/decoder/attention/attn_7/Add)
Tanh (/decoder/attention_7/Tanh)
Softmax (/decoder/attention_7/Softmax)
MatMul (/decoder/MatMul_7)
Transpose (/decoder/Transpose_15)
Concat (/decoder/Concat_7)
GRU (/decoder/rnn_7/GRU)
LogSoftmax (/decoder/LogSoftmax_7)
ArgMax (/decoder/ArgMax_7)
Unsqueeze (/decoder/Unsqueeze_7)
Transpose (/decoder/Transpose_16)
Gather (/decoder/emb_8/Gather)
Expand (/decoder/attention_8/Expand)
Transpose (/decoder/attention_8/Transpose)
Concat (/decoder/attention_8/Concat)
MatMul (/decoder/attention/attn_8/MatMul)
Add (/decoder/attention/attn_8/Add)
Tanh (/decoder/attention_8/Tanh)
Softmax (/decoder/attention_8/Softmax)
MatMul (/decoder/MatMul_8)
Transpose (/decoder/Transpose_17)
Concat (/decoder/Concat_8)
GRU (/decoder/rnn_8/GRU)
LogSoftmax (/decoder/LogSoftmax_8)
ArgMax (/decoder/ArgMax_8)
Unsqueeze (/decoder/Unsqueeze_8)
Transpose (/decoder/Transpose_18)
Gather (/decoder/emb_9/Gather)
Expand (/decoder/attention_9/Expand)
Transpose (/decoder/attention_9/Transpose)
Concat (/decoder/attention_9/Concat)
MatMul (/decoder/attention/attn_9/MatMul)
Add (/decoder/attention/attn_9/Add)
Tanh (/decoder/attention_9/Tanh)
Softmax (/decoder/attention_9/Softmax)
MatMul (/decoder/MatMul_9)
Transpose (/decoder/Transpose_19)
Concat (/decoder/Concat_9)
GRU (/decoder/rnn_9/GRU)
LogSoftmax (/decoder/LogSoftmax_9)
Unsqueeze (/decoder/Unsqueeze_9)
Unsqueeze (/decoder/Unsqueeze_10)
Unsqueeze (/decoder/Unsqueeze_11)
Unsqueeze (/decoder/Unsqueeze_12)
Unsqueeze (/decoder/Unsqueeze_13)
Unsqueeze (/decoder/Unsqueeze_14)
Unsqueeze (/decoder/Unsqueeze_15)
Unsqueeze (/decoder/Unsqueeze_16)
Unsqueeze (/decoder/Unsqueeze_17)
Unsqueeze (/decoder/Unsqueeze_18)
Concat (/decoder/Concat_10)
Transpose (/decoder/Transpose_20)
FusedMatMul (MatMul_With_Transpose)
FusedMatMul (MatMul_With_Transpose_token_18)
FusedMatMul (MatMul_With_Transpose_token_19)
FusedMatMul (MatMul_With_Transpose_token_20)
FusedMatMul (MatMul_With_Transpose_token_21)
FusedMatMul (MatMul_With_Transpose_token_22)
FusedMatMul (MatMul_With_Transpose_token_23)
FusedMatMul (MatMul_With_Transpose_token_24)
FusedMatMul (MatMul_With_Transpose_token_25)

as you can see the difference is only on last 8 lines (matmuls token ids differs). Hope it'll help...

F

To reproduce

Look description.

Urgency

Urgent

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.1 release

ONNX Runtime API

C++

Architecture

X64

Execution Provider

OpenVINO

Execution Provider Library Version

2023.3

debugmenot commented 5 months ago

Just to note: the issue looks independent of OpenVino version - got experiments with different. Also built all from scratch many times on different systems - same results.

debugmenot commented 5 months ago

Update: 1.14.1 also works, but the performance is about 10-15% lower. 1.15 and higher affected by issue.

jywu-msft commented 5 months ago

+@sfatimar, @preetha-intel

debugmenot commented 5 months ago

any update?

sfatimar commented 5 months ago

Can we have access to the model. It seems there are 11 subgraphs being formed and 167 nodes are being placed on CPUEP. But it is hard to debug without the model.

debugmenot commented 5 months ago

@sfatimar

dumbmodel.onnx.zip Dumb model is in attachment. To visualize issue there is also small log of test run:

Here i'm iterating over the same image. All result except first are broken.

`f1race@build_server_nvidia:/opt/ort_dev$ ./test --image images/test/dumb100x100text.jpg [info] Wellcome to first 0.0.1 [info] Available provider: CUDAExecutionProvider [info] Available provider: OpenVINOExecutionProvider [info] Available provider: XnnpackExecutionProvider [info] Available provider: CPUExecutionProvider [-] Selected provider: OpenVINOExecutionProvider Input 0 : name=input.1 Output 0 : name=1389 [-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: A, CLASS: 13, CONF: -0.5359584 [info] CHAR: 4, CLASS: 7, CONF: -2.073846 [info] CHAR: 6, CLASS: 9, CONF: -2.010087 [info] CHAR: 6, CLASS: 9, CONF: -1.8180711 [info] CHAR: D, CLASS: 16, CONF: -2.448421 [info] CHAR: S, CLASS: 31, CONF: -2.7345552 [info] CHAR: , CLASS: 2, CONF: -0.009441723 [info] CHAR: , CLASS: 2, CONF: -0.05160664 [info] CHAR: , CLASS: 2, CONF: -0.097647004

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854`

debugmenot commented 5 months ago

Once again, this happened ONLY with OpenVINO EP with Onnxruntime >= 1.15 and any version of OpenVino.

No issues with Onnxruntime 1.13.1 and 1.14.1 (lower not tested).

CPUEP, XnnpackEP, CudaEP works well with this model and same inference code in any version of ORT including the latest one.

henxing commented 5 months ago

I'm seeing a similar issue that occurs in Python with onnxruntime-openvino version 1.16.0. I am currently stuck using python 3.8, so I cannot test 1.17, but see the following for a test script with three very simple models that show how one of them (BrokenModel) generates different results than PyTorch when using onnxruntime. If this behavior is different enough from this issue, I'm happy to open another issue to track it.

import numpy as np
import onnxruntime as rt
import torch
from torch import nn

class BrokenModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(64, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3))

class BatchMeanModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(64, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3)), output.mean()

class FewChannelModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 3, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(3, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3))

def run_model_pytorch_onnxruntime(arch, path):
    model = arch()
    model.eval()
    print("=" * 80)
    print(model)

    data = torch.ones(2, 3, 224, 224)
    data[0] *= 0

    print("Torch:")
    for _ in range(2):
        result = model(data)
        print(result)
    print()

    torch.onnx.export(
        model,
        data,
        path,
        input_names=["input"],
        output_names=["output"],
        export_params=True,
        dynamic_axes={name: {0: "batch_size"} for name in ("input", "output")},
        verbose=False,
    )

    sess_options = rt.SessionOptions()
    sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_DISABLE_ALL

    print("Onnxruntime:")
    rt_sess = rt.InferenceSession(
        path, sess_options, providers=["OpenVINOExecutionProvider"], provider_options=[{"device_id": "GPU"}]
    )
    for _ in range(2):
        outputs = rt_sess.run(None, {"input": data.numpy()})
        print(outputs)
    print()

if __name__ == "__main__":
    run_model_pytorch_onnxruntime(BrokenModel, "broken_model.onnx")
    print()
    run_model_pytorch_onnxruntime(BatchMeanModel, "batch_mean_model.onnx")
    print()
    run_model_pytorch_onnxruntime(FewChannelModel, "few_channel_model.onnx")

You'll need to install torch, onnxruntime-openvino, and numpy to run this script.

debugmenot commented 4 months ago

@sfatimar, Hi! Any updates? I've uploaded the model for bug investigation.

ankitm3k commented 4 months ago

Hi @debugmenot , I have tested the script suggested by @henxing using OpenVINO Toolkit v2024.1 (w_openvino_toolkit_windows_2024.1.0.dev20240405_x86_64) and OVEP v1.18.0 (this version update is now merged and available on the latest main of microsoft/onnruntime repo) on a Windows machine. I ran inference for 5 iterations and the PyTorch vs ORT OpenVINO EP results for every inference iterations were same and OVEP results were quite accurate upto 3 decimal precision against torch results. Please find the below run log for the same -

================================================================================ BrokenModel( (conv_1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv_2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1)) ) Torch: tensor([-0.1026, -0.0569], grad_fn=) tensor([-0.1026, -0.0569], grad_fn=)

Onnxruntime: [array([-0.1026001 , -0.05670166], dtype=float32)] [array([-0.1026001 , -0.05670166], dtype=float32)] [array([-0.1026001 , -0.05670166], dtype=float32)] [array([-0.1026001 , -0.05670166], dtype=float32)] [array([-0.1026001 , -0.05670166], dtype=float32)]

================================================================================ BatchMeanModel( (conv_1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv_2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1)) ) Torch: (tensor([0.1573, 0.1438], grad_fn=), tensor(0.1506, grad_fn=)) (tensor([0.1573, 0.1438], grad_fn=), tensor(0.1506, grad_fn=))

Onnxruntime: [array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)] [array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)] [array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)] [array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)] [array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]

================================================================================ FewChannelModel( (conv_1): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv_2): Conv2d(3, 1, kernel_size=(1, 1), stride=(1, 1)) ) Torch: tensor([-0.1036, -0.1638], grad_fn=) tensor([-0.1036, -0.1638], grad_fn=)

Onnxruntime: [array([-0.10357666, -0.16418457], dtype=float32)] [array([-0.10357666, -0.16418457], dtype=float32)] [array([-0.10357666, -0.16418457], dtype=float32)] [array([-0.10357666, -0.16418457], dtype=float32)] [array([-0.10357666, -0.16418457], dtype=float32)]

ankitm3k commented 4 months ago

@sfatimar

dumbmodel.onnx.zip Dumb model is in attachment. To visualize issue there is also small log of test run:

Here i'm iterating over the same image. All result except first are broken.

`f1race@build_server_nvidia:/opt/ort_dev$ ./test --image images/test/dumb100x100text.jpg [info] Wellcome to first 0.0.1 [info] Available provider: CUDAExecutionProvider [info] Available provider: OpenVINOExecutionProvider [info] Available provider: XnnpackExecutionProvider [info] Available provider: CPUExecutionProvider [-] Selected provider: OpenVINOExecutionProvider Input 0 : name=input.1 Output 0 : name=1389 [-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: A, CLASS: 13, CONF: -0.5359584 [info] CHAR: 4, CLASS: 7, CONF: -2.073846 [info] CHAR: 6, CLASS: 9, CONF: -2.010087 [info] CHAR: 6, CLASS: 9, CONF: -1.8180711 [info] CHAR: D, CLASS: 16, CONF: -2.448421 [info] CHAR: S, CLASS: 31, CONF: -2.7345552 [info] CHAR: , CLASS: 2, CONF: -0.009441723 [info] CHAR: , CLASS: 2, CONF: -0.05160664 [info] CHAR: , CLASS: 2, CONF: -0.097647004

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854`

We are investigating the issues faced while running your model using OpenVINO EP execution provider.

debugmenot commented 4 months ago

@ankitm3k hi! Did you confirm the bug? If so, any ETA for patch?

ankitm3k commented 4 months ago

Hi @debugmenot, I have investigated the issues with your given onnx model file i.e. dumbmodel.onnx. When performing inference with your model, there were many subgraph partitions with your model due to which most of the nodes were falling back to CPU EP. This causes lower performance as the model graph is completely not running with OpenVINO EP. The above fix enables the whole model to be supported on OpenVINOExecutionProvider and improves performance for your model.

I recommend you to use latest OpenVINO Toolkit v2024.1 along with the above patch to fix the same. I also have investigated the tensor outputs as a result of multiple inference iterations over the same input data and they were found to be consistent / accurate with the first inference results for my build.

debugmenot commented 1 month ago

@ankitm3k Hi. Update: issue still not fixed... Just checked. Performance is better now... but: Onnxruntime 1.14.1 + OV: [02:42:09.361] [I] [74706] [4] [car] HOMEP: T454BE199 [02:42:11.675] [I] [74706] [6] [car] HOMEP: X212EX197 [02:42:14.785] [I] [74706] [13] [car] HOMEP: O353XM199 [02:42:16.420] [I] [74706] [16] [car] HOMEP: H002XC199 [02:42:17.709] [I] [74706] [18] [car] HOMEP: P346AB197 [02:42:18.525] [I] [74706] [20] [car] HOMEP: A001OT197 [02:42:19.709] [I] [74706] [21] [car] HOMEP: E072MK199 [02:42:21.144] [I] [74706] [23] [car] HOMEP: B797HK197 [02:42:22.028] [I] [74706] [25] [car] HOMEP: O369CX177 [02:42:24.947] [I] [74706] [30] [car] HOMEP: B410KA17 [02:42:25.968] [I] [74706] [33] [car] HOMEP: K558AT197 [02:42:36.141] [I] [74706] [52] [car] HOMEP: C159XT199 [02:42:41.442] [I] [74706] [60] [car] HOMEP: O905OT190 [02:42:43.093] [I] [74706] [63] [car] HOMEP: Y902OA190 [02:42:46.568] [I] [74706] [68] [car] HOMEP: E159YY150 [02:42:47.770] [I] [74706] [71] [car] HOMEP: M181YA197

Onnxruntime 1.18.1 + OpenVinoEP 2024.3 + your GRU OP Patch: [01:58:34.495] [I] [7342] [6] [car] HOMEP: T454BE199 [01:58:36.900] [I] [7342] [10] [car] HOMEP: X2XXX22X22 [01:58:39.927] [I] [7342] [20] [car] HOMEP: O333O33O33 [01:58:41.637] [I] [7342] [23] [car] HOMEP: H000H00H00 [01:58:42.832] [I] [7342] [27] [car] HOMEP: P333P33P33 [01:58:43.725] [I] [7342] [29] [car] HOMEP: A000A00A00 [01:58:44.849] [I] [7342] [30] [car] HOMEP: E000E00E00 [01:58:46.330] [I] [7342] [34] [car] HOMEP: B777B77B77 [01:58:51.137] [I] [7342] [51] [car] HOMEP: K555K55K55 [01:59:01.337] [I] [7342] [63] [car] HOMEP: C1CCC11C11 [01:59:06.587] [I] [7342] [70] [car] HOMEP: O999O99O99 [01:59:08.301] [I] [7342] [74] [car] HOMEP: Y999Y99Y99 [01:59:11.775] [I] [7342] [80] [car] HOMEP: E1EEE11E11 [01:59:13.006] [I] [7342] [85] [car] HOMEP: M111M11M11

i can prepare test project (source+model+image) for you. can you share your email please?

debugmenot commented 1 month ago

But with patch behaviour is slightly different - results after first is a little bit differs with results without patch but looks approx the same (incorrect)... Is there a dirty fix possible if i change supported ops in data_ops.cc to 1.14.1 version, or something like this? How to do this properly? Cant use legacy ort versions in new build of our software because of API incompatibility.

debugmenot commented 1 month ago

@ankitm3k I've finally found an issue, at least WHERE it is EXACTLY. If
//{"Unsqueeze", V_2020_4, {"CPU", "GPU"}}, //is commented in data_ops.cc

all works as expected :) Issue needs an investigation.

its strange because unsqueeze is defined exactly same way as in 1.14.1 and 1.13.1 versions...