opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
78.35k stars 55.74k forks source link

Can't parse layer for node='layer_normalization/strided_slice' of type='StridedSlice' #23872

Open charvey2718 opened 1 year ago

charvey2718 commented 1 year ago

System Information

C++ version details OpenCV version: 4.6.0 (but I also checked the issue exists in the latest OpenCV Python release version 4.7.0 as below) Operating System / Platform: Windows 10 Compiler & compiler version: GCC 8.1.0

Python version details OpenCV python version: 4.7.0 Operating System / Platform: Window 10 Python version: 3.9.0

Detailed description

According to here, cv::dnn::LayerNormLayer is a supported layer.

However, loading a Tensorflow net containing a LayerNormalization layer into OpenCV using cv::dnn::readNet (C++) or cv2.dnn.readNet (Python) generates the following error:

[ERROR:0@0.218] global tf_importer.cpp:3182 cv::dnn::dnn4_v20221220::`anonymous-namespace'::TFImport
er::parseNode DNN/TF: Can't parse layer for node='generator/layer_normalization/strided_slice_3' of
type='StridedSlice'. Exception: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modules\dnn\sr
c\tensorflow\tf_importer.cpp:2822: error: (-2:Unspecified error) Input layer not found: generator/la
yer_normalization/Shape in function 'cv::dnn::dnn4_v20221220::`anonymous-namespace'::TFImporter::con
nect'

Traceback (most recent call last):
  File "MWE.py", line 36, in <module>
    model = cv2.dnn.readNet(os.getcwd() + os.sep + "mwe.pb");
cv2.error: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modules\dnn\src\tensorflow\tf_impor
ter.cpp:2822: error: (-2:Unspecified error) Input layer not found: generator/layer_normalization/Sha
pe in function 'cv::dnn::dnn4_v20221220::`anonymous-namespace'::TFImporter::connect'

The mention of strided slice led me to this, however, I was unable to get optimize_for_inference.py to work for me. It may be related, but I'm not sure it's the same issue.

Steps to reproduce

This Python code creates a Keras model (I'm using version 2.11.0) containing a Conv2D layer, and a LayerNormalization layer; freezes and saves the model, and then attempts to load it into OpenCV, generating an error.

If you change 'outputs' parameter to 'convolved' as per the comment in the 'create_model(...)' function, then the problem goes away, indicating that LayerNormalization is the issue.

I actually want to load my pb model into the C++ version of OpenCV, but it was easier to demonstrate the issue in a MWE in one Python code. The same issue applies to both cv::dnn::readNet (C++) and cv2.dnn.readNet (Python). It all relates to the readNet() command in either version anyway.

import tensorflow as tf
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
import os
import cv2

def create_model():
    input = tf.keras.layers.Input(shape = (512, 512, 3), name = "gen_input_image")
    convolved = tf.keras.layers.Conv2D(128, kernel_size = 4)(input)
    normalized = tf.keras.layers.LayerNormalization()(convolved) # this line is the problem
    return tf.keras.Model(inputs = input, outputs = normalized, name = "generator") # change outputs to convolved and the problem goes away

def save_weights():
    model = create_model()
    # freeze graph
    full_model = tf.function(lambda inputs: model(inputs))
    full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))
    frozen_func = convert_variables_to_constants_v2(full_model)
    frozen_func.graph.as_graph_def()
    tf.io.write_graph(graph_or_graph_def = frozen_func.graph, logdir = os.getcwd(), name = "mwe.pb", as_text = False)

if __name__ == "__main__":
    save_weights()
    model = cv2.dnn.readNet(os.getcwd() + os.sep + "mwe.pb"); # generates the error

I've also uploaded the model here in mwe.zip, so you can skip the Tensorflow / Keras part of the code above, and then the relevant MWE code is just the last line.

Issue submission checklist

fengyuentau commented 1 year ago

We do support TensorFlow models but it's not perfect. We do support layer norm but currently we only parse it if it is a ONNX model.

Since your exported model breaks layer norm into several operators as the following:

image

The problem goes to the implementation of each operator instead of the one of layer norm. If you are looking for a quick fix, I suggest you try using https://github.com/onnx/tensorflow-onnx to do a tf-to-onnx convertion, then load the converted onnx using opencv.

Will update soon once I locate the bug.

fengyuentau commented 1 year ago

Hello @charvey2718 , I located the problems and there is a chance that they can be fixed. Problems are:

  1. Shape operator and its following operator Reshape are fused in a wrong way.

https://github.com/opencv/opencv/blob/c982be39248dc01a8c1f3f79fab22292e94c6b8d/modules/dnn/src/tensorflow/tf_graph_simplifier.cpp#L625-L636

https://github.com/opencv/opencv/blob/c982be39248dc01a8c1f3f79fab22292e94c6b8d/modules/dnn/src/tensorflow/tf_graph_simplifier.cpp#L785

Comment the above line leads to the following issue.

  1. Operator StridedSlice has an empty begin input. See https://github.com/opencv/opencv/issues/23500#issuecomment-1514430004 for details. After this is fixed, there goes another new issue.

  2. Operator Shape is not supported due to current dnn engine lacking support of dynamic input shape (dynamic H or W). What I see in your model is that only the batch is unknown. So there maybe a chance that I can support parsing Shape in this limited condition.

charvey2718 commented 1 year ago

Thanks for the suggestion for a quick fix. I did try doing a tf2onnx conversion, but it still wouldn't load into OpenCV when LayerNormalization is used, this time giving a different error though (the one below). Just as for the MWE, skipping the LayerNormalization lets it load.

[ERROR:0@5.784] global onnx_importer.cpp:1054 cv::dnn::dnn4_v20221220::ONNXImporter::handleNode DNN/
ONNX: ERROR during processing node with 1 inputs and 1 outputs: [ReduceProd]:(onnx_node!ReduceProd__
64) from domain='ai.onnx'
Traceback (most recent call last):
  File "MWE.py", line 29, in <module>
    model = cv2.dnn.readNet("tfmodel.onnx")
cv2.error: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.
cpp:1073: error: (-2:Unspecified error) in function 'cv::dnn::dnn4_v20221220::ONNXImporter::handleNo
de'
> Node [ReduceProd@ai.onnx]:(onnx_node!ReduceProd__64) parse error: OpenCV(4.7.0) D:\a\opencv-python
\opencv-python\opencv\modules\dnn\src\layers\reduce_layer.cpp:336: error: (-215:Assertion failed) in
puts.size() > 0 in function 'cv::dnn::ReduceLayerImpl::getMemoryShapes'

As far as I can tell from the ONNX model, the LayerNormalization layer is still decomposed into sub operators, e.g. strided_slice.

I understood from another issue (sorry, I lost the link) that converting an untrained architecture to ONNX (as in my MWE) can be problematic. Interestingly, when I convert my trained model (containing LayerNormalization) to ONNX then the error changes to

[ERROR:0@0.216] global C:\opencv-4.6.0\modules\dnn\src\onnx\onnx_importer.cpp (1021) handleNode DNN/
ONNX: ERROR during processing node with 2 inputs and 1 outputs: [Gather]:(onnx_node!Gather__913) fro
m domain='ai.onnx'
OpenCV(4.6.0) C:\opencv-4.6.0\modules\dnn\src\onnx\onnx_importer.cpp:1040: error: (-2:Unspecified er
ror) in function 'handleNode'
> Node [Gather@ai.onnx]:(onnx_node!Gather__913) parse error: OpenCV(4.6.0) C:\opencv-4.6.0\modules\d
nn\src\onnx\onnx_importer.cpp:2907: error: (-215:Assertion failed) indexMat.total() == 1 in function
 'parseGather'

From here, this seems to come back to the dynamic shape, i.e., not specifying a batch size. Unfortunately, specifying batch size as 1 then gives yet another error when I try to load it:

[ERROR:0@0.211] global C:\opencv-4.6.0\modules\dnn\src\onnx\onnx_importer.cpp (1021) handleNode DNN/
ONNX: ERROR during processing node with 2 inputs and 1 outputs: [Squeeze]:(onnx_node!Squeeze__13) fr
om domain='ai.onnx'
OpenCV(4.6.0) C:\opencv-4.6.0\modules\dnn\src\onnx\onnx_importer.cpp:1040: error: (-2:Unspecified er
ror) in function 'handleNode'
> Node [Squeeze@ai.onnx]:(onnx_node!Squeeze__13) parse error: OpenCV(4.6.0) C:\opencv-4.6.0\modules\
dnn\src\onnx\onnx_importer.cpp:2464: error: (-215:Assertion failed) node_proto.input_size() == 1 in
function 'parseSqueeze'

Unfortunately, as much as I'd like a quick fix, using the ONNX format hasn't worked and may be a digression from the original issue.

Following your last post, am I right to think that commenting the line in opencv/modules/dnn/src/tensorflow/tf_graph_simplifier.cpp as per item 1 in your list, and changing batch size to 1 in my model, should resolve my issues? (Item 2 in your list did not seem to require any fix.)

fengyuentau commented 1 year ago
  1. Could you share the converted onnx model or provide the command that you use for convertion?
  2. am I right to think that commenting the line in

    Commenting solves item 1 but item 2 will be another block.

Another quick try on your side is once the model is converted to ONNX, use onnxsim to set a complete fixed input shape. This tool also simplifies the graph by eliminating some redundant operators.

charvey2718 commented 1 year ago

I added the following to my python MWE above, and called tf2onnx() instead of save_weights():

def tf2onnx():
    model = create_model()
    tf.saved_model.save(model, './tensorflow')
    os.system('python -m tf2onnx.convert --saved-model ./tensorflow --output tfmodel.onnx')
    os.system('onnxsim tfmodel.onnx tfmodelsim.onnx')

Running this outputs the following

# python MWE.py
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op while saving (showing 1 of 1). These functions will not be directly callable after loading.
C:\Program Files\Python39\lib\runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may resu
lt in unpredictable behaviour
  warn(RuntimeWarning(msg))
2023-06-28 08:58:54,068 - WARNING - '--tag' not specified for saved_model. Using --tag serve
2023-06-28 08:58:54,175 - INFO - Signatures found in model: [serving_default].
2023-06-28 08:58:54,175 - WARNING - '--signature_def' not specified, using first signature: serving_default
2023-06-28 08:58:54,175 - INFO - Output names: ['layer_normalization']
2023-06-28 08:58:54,222 - INFO - Using tensorflow=2.12.0, onnx=1.14.0, tf2onnx=1.14.0/8f8d49
2023-06-28 08:58:54,222 - INFO - Using opset <onnx, 15>
2023-06-28 08:58:54,237 - INFO - Computed 0 values for constant folding
2023-06-28 08:58:54,253 - INFO - Optimizing ONNX model
2023-06-28 08:58:54,353 - INFO - After optimization: Cast -2 (3->1), Const -3 (9->6), Identity -2 (2->0)
2023-06-28 08:58:54,369 - INFO -
2023-06-28 08:58:54,369 - INFO - Successfully converted TensorFlow model ./tensorflow to ONNX
2023-06-28 08:58:54,369 - INFO - Model inputs: ['gen_input_image']
2023-06-28 08:58:54,369 - INFO - Model outputs: ['layer_normalization']
2023-06-28 08:58:54,369 - INFO - ONNX model is saved at tfmodel.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃                    ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ BatchNormalization │ 1              │ 1                │
│ Cast               │ 1              │ 0                │
│ Constant           │ 6              │ 7                │
│ Conv               │ 1              │ 1                │
│ Div                │ 1              │ 1                │
│ Gather             │ 1              │ 0                │
│ ReduceMean         │ 1              │ 1                │
│ ReduceProd         │ 1              │ 0                │
│ ReduceSumSquare    │ 1              │ 1                │
│ Reshape            │ 2              │ 2                │
│ Shape              │ 1              │ 0                │
│ Squeeze            │ 1              │ 1                │
│ Sub                │ 1              │ 1                │
│ Transpose          │ 2              │ 2                │
│ Model Size         │ 2.0MiB         │ 2.0MiB           │
└────────────────────┴────────────────┴──────────────────┘
[ERROR:0@5.033] global onnx_importer.cpp:1054 cv::dnn::dnn4_v20221220::ONNXImporter::handleNode DNN/ONNX: ERROR during processing node with 5 inputs and 1 outputs: [BatchNormalization]:(onnx_
node!StatefulPartitionedCall/generator/layer_normalization/FusedBatchNormV3) from domain='ai.onnx'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
cv2.error: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp:1073: error: (-2:Unspecified error) in function 'cv::dnn::dnn4_v20221220::ONNXImporter:
:handleNode'
> Node [BatchNormalization@ai.onnx]:(onnx_node!StatefulPartitionedCall/generator/layer_normalization/FusedBatchNormV3) parse error: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modul
es\dnn\src\onnx\onnx_importer.cpp:591: error: (-5:Bad argument) Blob Squeeze__11:0 not found in const blobs in function 'cv::dnn::dnn4_v20221220::ONNXImporter::getBlob'

As you can see, onnxsim didn't help with readNet.

I've attached a zip tfmodel.zip with tfmodel.onnx and tfmodelsim.onnx.

Thank you for the work to fix the tf importer. I will download and test it later and let you know how I get on.

fengyuentau commented 1 year ago

image

input_mean and input_var from your converted ONNX model exist but do not have values. This may due to the convertion of untrained model (or the tf model is not saved in eval mode) as you said above.

Another option for you to try is either fake training a bit your model or set it in eval mode, then export it.

charvey2718 commented 1 year ago

I thought the same and so I also applied the conversion to onnx + onnxsim process to my trained full architecture. Doing so gives the same error as before when trying to load an untrained onnx model containing LayerNormalization. The trained onnxsim version of the full architecture is here (obviously it's a much bigger file as the architecture is much more complex).

charvey2718 commented 1 year ago

I was just wondering if you still think this bug is fixable, and if so, on what timescale?

Please don’t misunderstood me - I’m not being impatient, and I’m very grateful to you for the time you’ve spent on this already. I’m just trying to get an idea of whether I should wait for a fix, or work on other possible solutions, since the ONNX import workaround you suggested seems to have a similar problem.

I could for instance try the Tensorflow C++ API or the ONNX Runtime. Neither is desirable, especially as my wider project is considerably invested in OpenCV, but it may be necessary.

fengyuentau commented 1 year ago

Hello @charvey2718 , sorry for the late response. I was deeply involved in other projects. The fix for your TF model may need a lot of effort and therefore it is marked as lower priority on my side. I am not sure when I can finish the patch. So I think you can try other inference framework to get started.

charvey2718 commented 1 year ago

No worries. Thanks for your efforts. In the meantime, cppflow looks promising. But I’ll enthusiastically await OpenCV DNN developments all the same!