[Bug] Could not run model optimizer with ONNX model

OscarPedaVendere commented 1 year ago

System information (version)

OpenVINO => 2022.1 (docker container dlstreamer/dlstreamer)
Operating System / Platform => Ubuntu 22.04 LTS
Problem classification: Model Conversion
Framework: TensorFlow
Model name: Custom RNN model

Detailed description

Hello. I'm running dlstreamer/dlstreamer on a fitlet2 device https://fit-iot.com/web/products/fitlet2/ running Ubuntu 22.04. My goal is to run my custom model exported from Tensorflow (keras backend) to ONNX format and integrate it into the dlstreamer framework in order to compile custom C++ code that users gstreamer plugins and run inference on this custom model. Model is exported directly from tensorflow 1.15.5 running on Nvidia GPU after training for about 62 epochs. There is a problem when parsing a certain convolution layer. It says that the dilation size is greater than the padding size. Could you provide some help? What do I have to do by now? Do I have to keep the model and change export settings or do I have to use another exporting technique such as frozen model? Thank you in advance.

mo -w engine_file.onnx --input_shape [1,3,48,96] --input image_input --output tf_op_layer_ArgMax
Model Optimizer arguments:
Common parameters:
    - Path to the Input Model:  /home/engine_file.onnx
    - Path for generated IR:    /home/.
    - IR output name:   engine_file
    - Log level:    ERROR
    - Batch:    Not specified, inherited from the model
    - Input layers:     image_input
    - Output layers:    tf_op_layer_ArgMax
    - Input shapes:     [1,3,48,96]
    - Source layout:    Not specified
    - Target layout:    Not specified
    - Layout:   Not specified
    - Mean values:  Not specified
    - Scale values:     Not specified
    - Scale factor:     Not specified
    - Precision of IR:  FP32
    - Enable fusing:    True
    - User transformations:     Not specified
    - Reverse input channels:   False
    - Enable IR generation for fixed input shape:   False
    - Use the transformations config file:  None
Advanced parameters:
    - Force the usage of legacy Frontend of Model Optimizer for model conversion into IR:   False
    - Force the usage of new Frontend of Model Optimizer for model conversion into IR:  False
OpenVINO runtime found in:  /usr/local/lib/python3.8/dist-packages/openvino
OpenVINO runtime version:   2022.1.0-7019-cdb9bec7210-releases/2022/1
Model Optimizer version:    2022.1.0-7019-cdb9bec7210-releases/2022/1
[ ERROR ]  -------------------------------------------------
[ ERROR ]  ----------------- INTERNAL ERROR ----------------
[ ERROR ]  Unexpected exception happened.
[ ERROR ]  Please contact Model Optimizer developers and forward the following information:
[ ERROR ]  While validating ONNX node '<Node(Conv): res3a_branch2a>':
Check 'window_dilated_dim <= data_padded_dilated_dim' failed at core/shape_inference/include/convolution_shape_inference.hpp:209:
While validating node 'v1::Convolution Convolution_460 (re_lu_4/Relu:0[0]:f32{1,64,1,1}, res3a_branch2a_W_new[0]:f32{128,64,3,3}) -> (dynamic...)' with friendly_name 'Convolution_460':
Window after dilation has dimension (dim: 3) larger than the data shape after padding (dim: 2) at axis 0.

[ ERROR ]  Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/openvino/tools/mo/main.py", line 533, in main
    ret_code = driver(argv)
  File "/usr/local/lib/python3.8/dist-packages/openvino/tools/mo/main.py", line 489, in driver
    graph, ngraph_function = prepare_ir(argv)
  File "/usr/local/lib/python3.8/dist-packages/openvino/tools/mo/main.py", line 394, in prepare_ir
    ngraph_function = moc_pipeline(argv, moc_front_end)
  File "/usr/local/lib/python3.8/dist-packages/openvino/tools/mo/moc_frontend/pipeline.py", line 147, in moc_pipeline
    ngraph_function = moc_front_end.convert(input_model)
RuntimeError: While validating ONNX node '<Node(Conv): res3a_branch2a>':
Check 'window_dilated_dim <= data_padded_dilated_dim' failed at core/shape_inference/include/convolution_shape_inference.hpp:209:
While validating node 'v1::Convolution Convolution_460 (re_lu_4/Relu:0[0]:f32{1,64,1,1}, res3a_branch2a_W_new[0]:f32{128,64,3,3}) -> (dynamic...)' with friendly_name 'Convolution_460':
Window after dilation has dimension (dim: 3) larger than the data shape after padding (dim: 2) at axis 0.

[ ERROR ]  ---------------- END OF BUG REPORT --------------
[ ERROR ]  -------------------------------------------------

Steps to reproduce

Here is the link to the model: Model

Issue submission checklist

[x] I report the issue, it's not a question
[x] I checked the problem with documentation, FAQ, open issues, Stack Overflow, etc and have not found solution
[x] There is reproducer code and related data files: images, videos, models, etc.

zulkifli-halim commented 1 year ago

Hi @OscarPedaVendere, I replicate using your model and get the same error as yours. I also ran your ONNX model with benchmark_app and received this error:

RuntimeError: While validating ONNX node '<Node(Conv): res3a_branch2a>':
Check 'window_dilated_dim <= data_padded_dilated_dim' failed at C:\j\workspace\private-ci\ie\build-windows-vs2019@3\b\repos\openvino\src\core\shape_inference\include\convolution_shape_inference.hpp:217:
While validating node 'v1::Convolution Convolution_460 (re_lu_4/Relu:0[0]:f32{1,64,1,1}, res3a_branch2a_W_new[0]:f32{128,64,3,3}) -> (dynamic...)' with friendly_name 'Convolution_460':
Window after dilation has dimension (dim: 3) larger than the data shape after padding (dim: 2) at axis 0.

I checked your ONNX model using Netron and it looks crumpled:

Can you share the source of the original model? We will take a look at this to see if this MO error can be solved and if this model is supported.

tomdol commented 1 year ago

@zulkifli-halim please create a jira ticket once you've received the model and assign it to me

OscarPedaVendere commented 1 year ago

Hi @OscarPedaVendere, I replicate using your model and get the same error as yours. I also ran your ONNX model with benchmark_app and received this error:
RuntimeError: While validating ONNX node '<Node(Conv): res3a_branch2a>':
Check 'window_dilated_dim <= data_padded_dilated_dim' failed at C:\j\workspace\private-ci\ie\build-windows-vs2019@3\b\repos\openvino\src\core\shape_inference\include\convolution_shape_inference.hpp:217:
While validating node 'v1::Convolution Convolution_460 (re_lu_4/Relu:0[0]:f32{1,64,1,1}, res3a_branch2a_W_new[0]:f32{128,64,3,3}) -> (dynamic...)' with friendly_name 'Convolution_460':
Window after dilation has dimension (dim: 3) larger than the data shape after padding (dim: 2) at axis 0.
I checked your ONNX model using Netron and it looks crumpled:

Can you share the source of the original model? We will take a look at this to see if this MO error can be solved and if this model is supported.

Hi. Thank you in advance for you support. I don't know why you see the model crumpled i can successfully open it with an onnx viewer webapp: link to the image

However here's two of the scripts that build the network (network structure itself and model builder class):

base_model.py.txt model_builder.py.txt

Thank you again for your support.

hbalasu1 commented 1 year ago

Hi @tomdol I have created JIRA for this case;

Ref : 94180

OscarPedaVendere commented 1 year ago

Thank you very much. Just let me know if you have any updates on this any time soon.

OscarPedaVendere commented 1 year ago

Sry, Any updates on this?

mlukasze commented 1 year ago

Hey @OscarPedaVendere Sorry you have to wait. It's planned to be fixed, but unfortunately is in queue due our current roadmap and schedule. But we will do this for sure. Stay tuned.

OscarPedaVendere commented 1 year ago

Hey @OscarPedaVendere Sorry you have to wait. It's planned to be fixed, but unfortunately is in queue due our current roadmap and schedule. But we will do this for sure. Stay tuned.

Ok thank you for your patience and your work. Looking forward to it :)

mbencer commented 1 year ago

I've tested the model using onnxruntime and it failed with:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from /home/mbencer/models/engine_file.onnx failed:This is an invalid model. In Node, ("tf_op_layer_Sum/Sum_reduce_min", ReduceSum, "", -1) : ("image_input": tensor(float),) -> ("tf_op_layer_Sum/Sum:0",) , Error Unrecognized attribute: axes for operator ReduceSum

also onnx checker:

import onnx
path = "/home/mbencer/models/engine_file.onnx"
onnx.checker.check_model(path)

failed for this model with:

File "/home/mbencer/venv/ov/lib/python3.8/site-packages/onnx/checker.py", line 97, in check_model
C.check_model_path(model)
onnx.onnx_cpp2py_export.checker.ValidationError: Unrecognized attribute: axes for operator ReduceSum==> Context: Bad node spec for node. Name: tf_op_layer_Sum/Sum_reduce_min OpType: ReduceSum

The reason is passing to ReduceSum axes as attribute (it was a way from opset11 - https://onnx.ai/onnx/operators/onnx__ReduceSum.html#reducesum-11) while the model is produced in opset13 (where ReduceSum axes is passed as input - https://onnx.ai/onnx/operators/onnx__ReduceSum.html#reducesum-13).

@OscarPedaVendere Could you provide the script used to TensorFlow-ONNX conversion?

mbencer commented 1 year ago

@OscarPedaVendere The second problem (based on my experiments - the last one) is using HardSigmoid actication function by LSTM: We can add support for the new activation function for our LSTMSequence core op (now the supported activation functions are: sigmoid, tanh, tanh), but it can be not very quick (it's a new feature), so please consider if other activation function here is applicable (at least as temporary solution).

OscarPedaVendere commented 1 year ago

@mbencer Thank you for replying me. I guess that the problem with LSTM could be solved in some way or another even though it could be a major issue but I'll try to figure it out by myself.

Both the problems seem to arise from the library keras2onnx, which is basically what I use to export the model. For the hardsigmoid, I don't think ( but i'll check it better soon) that I've given an hardsigmoid function as LSTM export. I guess it's the default behaviour?

For the reducesum problem it's the same reason. here's part of the code devolved to the export: onnxexport.txt

I don't know how to fix this.. maybe changing the keras2onnx library version, hoping that this doesn't cause too much dependencies problems.

Am i being exhaustive with these replies? Thank you in advance

mbencer commented 1 year ago

@OscarPedaVendere Thank you for response. I think that I have all needed information (at least for now). I'll try to export the model from Keras on my side (checking also if the version matters).

mbencer commented 1 year ago

@OscarPedaVendere Could you provide me also the model saved from Keras? In your model_builder.py script I haven't parametrs from experiment_spec arguments and models.backbones, models.base_model

OscarPedaVendere commented 1 year ago

@mbencer Thank you for your replies.

This model is part of a larger library that would not make sense to export as a whole. I've created a zip and run just a check if everything needed is there. To me it is not feasible atm to extract and check the whole library. I guess this could be allright anyway. Here's the zip.

openvino_bugfix.zip

The experiment_spec is a class that you can initialize with the collections.namedtuple() function after reading the contents in the specs/arabic_spec.txt file. That should be it; let me know if you have some errors while loading the spec.

mbencer commented 1 year ago

Hi @OscarPedaVendere, I've reproduced converting on my side and I think I have a solution. When I explicitly define the target opset version to 12, like:

keras_to_onnx(eval_model, "model.onnx", target_opset=12)

everything work. I've tested it with pip install tensorflow==1.15.5 keras==2.2.4 keras2onnx==1.7.0 onnx using python 3.7.

In such opset version axes can be passed to ReduceSum as an attribute and LSTM is created with Sigmoid instead unsupported HardSigmoid.

Confirmed on the direct inference via benchmark_app (./benchmark_app -m model.onnx --shape [1,3,48,96]) and using ModelOptmizer (mo -w model.onnx --input_shape [1,3,48,96] --input image_input --output tf_op_layer_ArgMax)

Please let me know if such solution works for you.

OscarPedaVendere commented 1 year ago

@mbencer OMG thank you so much! Thank you thank you. It works! So all I had to do is (~~follow the main train CJ~~) set the target opset to 12.. but i didn't know it was an option. Thank you so much.

So now that I have generated the IR I can use it for inference on dlstreamer, right? Are the weights of the model included as well, as I exported it from a checkpoint? If not, how do I train the model in openvino?

brmarkus commented 1 year ago

OpenVINO and DL-Streamer can use ONNX files directly for inference - just provide the path and filename to the ONNX-file (as with the above example ./benchmark_app -m model.onnx --shape [1,3,48,96]).

Using the Model-optimizer (MO) the ONNX-file can be converted to IR-format, consisting of a XML- (network and topology) and a BIN-file (weights).

So, yes, you can use the IR-format files with DL-Streamer (or the ONNX-file) - requiring to provide a model-proc-JSON-file.

mbencer commented 1 year ago

So now that I have generated the IR I can use it for inference on dlstreamer, right? Are the weights of the model included as well, as I exported it from a checkpoint? If not, how do I train the model in openvino?

Yes, the weights are saved in the produced ONNX model and also in IR format (where *.xml contains topology and *.bin weights).

@OscarPedaVendere Please confirm if everything (especially this DL-Streamer part) works for you and if we can close the issue.

OscarPedaVendere commented 1 year ago

Thank you @mbencer @brmarkus for your info. Indeed the model optimizer part works. So I am able to correctly export an IR format from ONNX custom model. The problem now is that perhaps DLStreamer doesn't accept it? it says Unsupported activation function; don't know if it has to do with the model itself or my setup. Tried with the openvino samples and it all works. Should I close this issue and open it in the dlstreamer github?

Output is this: log_output.txt

I am running the model via gst_launch on an image (can't build a proper pipeline right now). The proper pipeline should be RTSP input -> decode -> detect -> detect -> classify -> multiimagesink But my test pipeline is: Image input -> convert to video -> classify -> fakesink that should work too but it doesn't.

The classifier is my converted IR model. (Don't know if I have to use gvaclassify or gvadetect right now but tried both and doesn't make a difference).

Thank you in advance for the patience, support and help you gave me in this thread.

OscarPedaVendere commented 1 year ago

Also, on a different machine, openvino_2022.2.0.7713 C++ benchmark_app built on Ubuntu 20.04 gives me Unsupported activation function.

./benchmark_app -m ./engine_file.onnx -shape [1,3,48,96] [Step 1/11] Parsing and validating input arguments [ INFO ] Parsing input parameters [Step 2/11] Loading OpenVINO Runtime [ INFO ] OpenVINO: OpenVINO Runtime version ......... 2022.2.0 [ INFO ] Build ........... 2022.2.0-7713-af16ea1d79a-releases/2022/2 [ INFO ] [ INFO ] Device info: [ INFO ] CPU [ INFO ] openvino_intel_cpu_plugin version ......... 2022.2.0 [ INFO ] Build ........... 2022.2.0-7713-af16ea1d79a-releases/2022/2 [ INFO ] [ INFO ] [Step 3/11] Setting device configuration [ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to THROUGHPUT. [Step 4/11] Reading network files [ INFO ] Loading network files [ INFO ] Read network took 202.52 ms [ INFO ] Original network I/O parameters: Network inputs: image_input (node: image_input) : f32 / [...] Network outputs: tf_op_layer_ArgMax (node: tf_op_layer_ArgMax) : i64 / [...] tf_op_layer_Max (node: tf_op_layer_Max) : f32 / [...] [Step 5/11] Resizing network to match image sizes and given batch [ WARNING ] image_input: layout is not set explicitly, so it is defaulted to NCHW. It is STRONGLY recommended to set layout manually to avoid further issues. [ INFO ] Reshaping network: 'image_input': {1,3,48,96} [ INFO ] Reshape network took 5.85 ms [Step 6/11] Configuring input of the model [ INFO ] Network batch size: 1 Network inputs: image_input (node: image_input) : u8 / [N,C,H,W] Network outputs: tf_op_layer_ArgMax (node: tf_op_layer_ArgMax) : i64 / [...] tf_op_layer_Max (node: tf_op_layer_Max) : f32 / [...] [Step 7/11] Loading the model to the device [ ERROR ] Unsupported activation function

mbencer commented 1 year ago

@OscarPedaVendere Are you sure that you are using updated version of the model? (with target opset12) I've tested the model both on the current master and on af16ea1d79 version (from your message) and it works for me. Could you upload the model from the current conversion?

OscarPedaVendere commented 1 year ago

Yes i can confirm that the exported model with the target opset=12 is correctly processed by the model optimizer but it fails with the benchmark_app on both af16ea1d79 openvino build and dlstreamer/dlstreamer:devel docker image on fitlet2.

Had successfully generated the .xml and .bin files but couldn't be able to use it for inference on benchmark_app or as GVADetect gstreamer plugin on dlstreamer.

Link to the model

Python 3.6.9 keras2onnx==1.7.0 tensorflow==1.15.5 keras==2.2.4

mbencer commented 1 year ago

@OscarPedaVendere I'm confirming that in your exported model is used HardSigmoid as activation function, while in my version is Sigmoid. I belive that is the result of keras==2.2.4 used in my case. Could you try again with this keras version?

OscarPedaVendere commented 1 year ago

@mbencer I've checked indeed that the model exported has the HardSigmoid activation function but i'm correctly exporting the model with target_opset=12 and I have the same library versions as yours except for onnx which is 1.8.0 and python version which is 3.6.9. I'm referring to my local export environment which is a whole library under a docker container. I've tried changing onnx version, keras version and tensorflow version but couldn't do it either. The model is still correctly processed by the model optimizer though..

I don't know why, even with target opset 12, it exports the hardsigmoid function and still can't manage to export as sigmoid. Any suggestions on this?

Thank you in advance

mbencer commented 1 year ago

@OscarPedaVendere I've converted your model also in docker environment with the following config:

FROM python:3.7

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && \
    apt install -y software-properties-common

RUN apt update && apt install -y \
        git \
        build-essential \
        cmake \
        libtbb2 \
        gdb \
        python3-pip && \
        pip3 install --upgrade pip

RUN pip3 install -U tensorflow==1.15.5 keras==2.2.4 keras2onnx==1.7.0 onnx

COPY openvino_bugfix openvino_bugfix

ENV TF_KERAS = 1
RUN python3 openvino_bugfix/model_builder.py

mbencer commented 1 year ago

@OscarPedaVendere Please let me know if such configuration works for you

OscarPedaVendere commented 1 year ago

@mbencer I'm sorry, I've tried with your config and also tried with another method that exports during training of the model but it still doesn't work... Even with target_opset=12 it doesn't export this *****ing HardSigmoid function as simple Sigmoid.

I don't know how better replicate your side but I think that a Dockerfile like yours is undoubtedly hard to mistake. Would you provide me better details? Any chance that HardSigmoid will be supported in the future anyway?

Thanks in advance

mbencer commented 1 year ago

Hi @OscarPedaVendere, I've checked the script again and you are right, HardSigmoid is still generated using this dockerfile (I've had some mess in my environment) - sorry for that. It should be:

RUN pip3 install -U numpy==1.18.5 tensorflow==2.2.0 keras==2.4.0 keras2onnx==1.7.0 onnx==1.12.0

I am uploading all the code to be sure that everything is the same (be aware, it's just a draft version) - export_onnx_model.zip

Please let me know if now it works correctly for you ;)

mbencer commented 1 year ago

@OscarPedaVendere Can we close the ticket?

OscarPedaVendere commented 1 year ago

@mbencer Yes now it works! Thank you a lot for following me in these months! Thank you so much. I can confirm it works both with model optimizer and benchmark_app. In this case the activation is Sigmoid, finally and the output layer is a softmax in place of [tf_op_layer_Max, tf_op_layer_ArgMax] but i'll figure it out by myself on how to adapt the model to the openvino environment. Thank you so much.

openvinotoolkit / openvino