[Bug] Cannot convert tensorflow efficientnet-b0 trained using --data_format "channels_first"

openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference

https://docs.openvino.ai

Apache License 2.0

7.25k stars 2.26k forks source link

[Bug] Cannot convert tensorflow efficientnet-b0 trained using --data_format "channels_first" #10633

Closed KodeWorker closed 2 years ago

KodeWorker commented 2 years ago

System information

OpenVINO => 2020.4.752
Operating System / Platform => Windows 64 Bit
Compiler => Visual Studio 2022
Problem classification: Model Conversion
Framework: TensorFlow
Model name: efficientnet-b0 (channels_first)

Detailed description

I trained a custom efficientnet-b0 model using --data_format "channels_first": https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet

And I convert the checkpoint to frozen graph model: https://gist.github.com/KodeWorker/6bb37a7864d98836356b85bdea2d39eb

Then, I tried to convert the frozen graph model to OpenVINO IR. I was a failure and the error messages is shown below:

Model Optimizer arguments:
Common parameters:
        - Path to the Input Model:      /opt/intel/openvino_2021.4.752/efficientnet-b0.pb
        - Path for generated IR:        /opt/intel/openvino_2021.4.752/IR
        - IR output name:       efficientnet-b0
        - Log level:    ERROR
        - Batch:        Not specified, inherited from the model
        - Input layers:         Not specified, inherited from the model
        - Output layers:        Not specified, inherited from the model
        - Input shapes:         Not specified, inherited from the model
        - Mean values:  Not specified
        - Scale values:         Not specified
        - Scale factor:         Not specified
        - Precision of IR:      FP32
        - Enable fusing:        True
        - Enable grouped convolutions fusing:   True
        - Move mean values to preprocess section:       None
        - Reverse input channels:       False
TensorFlow specific parameters:
        - Input model in text protobuf format:  False
        - Path to model dump for TensorBoard:   None
        - List of shared libraries with TensorFlow custom layers implementation:        None
        - Update the configuration file with input/output node names:   None
        - Use configuration file used to generate the model with Object Detection API:  None
        - Use the config file:  None
        - Inference Engine found in:    /opt/intel/openvino_2021.4.752/python/python3.6/openvino
Inference Engine version:       2021.4.2-3974-e2a469a3450-releases/2021/4
Model Optimizer version:        2021.4.2-3974-e2a469a3450-releases/2021/4
Progress: [..............      ]  72.91% done/home/openvino/.local/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
[ ERROR ]  Exception occurred during running replacer "fusing" (<class 'extensions.middle.fusings.Fusing'>): After partial shape inference were found shape collision for node efficientnet-b0/model/stem/tpu_batch_normalization/FusedBatchNormV3/beta (old shape: [  1  32 256 256], new shape: [  1  32 256  -1])

However, I used the pretrained model which is "channels_last" and the conversion is a success. Why the "channels_first" model leads to a failed conversion?

Model Download Links:

Steps to reproduce

Train custom efficientnet-b0 model using --data_format "channels_first"
Export checkpoint to frozen graph model
Convert frozen graph model to openvino

jgespino commented 2 years ago

Hi @KodeWorker

I'm still looking into the issue, I tested with the latest pre-release dev package and I am seeing a different error message. Did you just retrained the model with --data_format "channels_first" or did you make other changes to the model? Could you provide the model in checkpoint format?

mo --input_model custom-efficientnet-b0.pb
Model Optimizer arguments:
Common parameters:
        - Path to the Input Model:      C:\Users\user\gh10633\custom-efficientnet-b0.pb
        - Path for generated IR:        C:\Users\user\gh10633\.
        - IR output name:       custom-efficientnet-b0
        - Log level:    ERROR
        - Batch:        Not specified, inherited from the model
        - Input layers:         Not specified, inherited from the model
        - Output layers:        Not specified, inherited from the model
        - Input shapes:         Not specified, inherited from the model
        - Source layout:        Not specified
        - Target layout:        Not specified
        - Layout:       Not specified
        - Mean values:  Not specified
        - Scale values:         Not specified
        - Scale factor:         Not specified
        - Precision of IR:      FP32
        - Enable fusing:        True
        - Enable grouped convolutions fusing:   True
        - Move mean values to preprocess section:       None
        - Reverse input channels:       False
        - Use legacy API for model processing:  False
        - Use the transformations config file:  None
TensorFlow specific parameters:
        - Input model in text protobuf format:  False
        - Path to model dump for TensorBoard:   None
        - List of shared libraries with TensorFlow custom layers implementation:        None
        - Update the configuration file with input/output node names:   None
        - Use configuration file used to generate the model with Object Detection API:  None
        - Use the config file:  None
        - OpenVINO runtime found in:    c:\users\user\openvino_dev215\lib\site-packages\openvino
OpenVINO runtime version:       2022.1.0-6682-121d59aa80a
Model Optimizer version:        2022.1.0-6682-121d59aa80a
[ ERROR ]  Cannot infer shapes or values for node "efficientnet-b0/model/head/dense/MatMul".
[ ERROR ]  MatMul input shapes are incorrect. COL_INDEX_DIMs are not equal. Node: efficientnet-b0/model/head/dense/MatMul. Shapes: [masked_array(data=[   1, 1280],
             mask=False,
       fill_value=-1000000007,
            dtype=int64), masked_array(data=[320,   2],
             mask=False,
       fill_value=-1000000007,
            dtype=int64)]
[ ERROR ]
[ ERROR ]  It can happen due to bug in custom shape infer function <function MatMul.infer at 0x00000288710DAD38>.
[ ERROR ]  Or because the node inputs have incorrect values/shapes.
[ ERROR ]  Or because input shapes are incorrect (embedded to the model or passed via --input_shape).
[ ERROR ]  Run Model Optimizer with --log_level=DEBUG for more information.
[ ERROR ]  Exception occurred during running replacer "REPLACEMENT_ID" (<class 'openvino.tools.mo.middle.PartialInfer.PartialInfer'>): Stopped shape/value propagation at "efficientnet-b0/model/head/dense/MatMul" node.
 For more information please refer to Model Optimizer FAQ, question #38. (https://docs.openvino.ai/latest/openvino_docs_MO_DG_prepare_model_Model_Optimizer_FAQ.html?question=38#question-38)

Regards, Jesus

jgespino commented 2 years ago

Hi @KodeWorker

Looking at the model, the shapes after efficientnet-b0/model/head/conv2d/Conv2D seem to be causing an issue, could you confirm the layer shapes are correct? Cropping the model at layer efficientnet-b0/model/head/conv2d/Conv2D converts successfully. I'm not sure if specifying the data_format affected the layer shapes after the Conv2D layer.

mo --input_model custom-efficientnet-b0.pb --output efficientnet-b0/model/head/conv2d/Conv2D
Model Optimizer arguments:
Common parameters:
        - Path to the Input Model:      C:\Users\user\Downloads\GitHub-Support\gh10633\custom-efficientnet-b0.pb
        - Path for generated IR:        C:\Users\user\Downloads\GitHub-Support\gh10633\.
        - IR output name:       custom-efficientnet-b0
        - Log level:    ERROR
        - Batch:        Not specified, inherited from the model
        - Input layers:         Not specified, inherited from the model
        - Output layers:        efficientnet-b0/model/head/conv2d/Conv2D
        - Input shapes:         Not specified, inherited from the model
        - Source layout:        Not specified
        - Target layout:        Not specified
        - Layout:       Not specified
        - Mean values:  Not specified
        - Scale values:         Not specified
        - Scale factor:         Not specified
        - Precision of IR:      FP32
        - Enable fusing:        True
        - Enable grouped convolutions fusing:   True
        - Move mean values to preprocess section:       None
        - Reverse input channels:       False
        - Use legacy API for model processing:  False
        - Use the transformations config file:  None
TensorFlow specific parameters:
        - Input model in text protobuf format:  False
        - Path to model dump for TensorBoard:   None
        - List of shared libraries with TensorFlow custom layers implementation:        None
        - Update the configuration file with input/output node names:   None
        - Use configuration file used to generate the model with Object Detection API:  None
        - Use the config file:  None
        - OpenVINO runtime found in:    c:\users\user\openvino_dev215\lib\site-packages\openvino
OpenVINO runtime version:       2022.1.0-6682-121d59aa80a
Model Optimizer version:        2022.1.0-6682-121d59aa80a
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: C:\Users\user\Downloads\GitHub-Support\gh10633\custom-efficientnet-b0.xml
[ SUCCESS ] BIN file: C:\Users\user\Downloads\GitHub-Support\gh10633\custom-efficientnet-b0.bin
[ SUCCESS ] Total execution time: 27.14 seconds.

Regards, Jesus

KodeWorker commented 2 years ago

Hi @jgespino ,

Sorry for the late response.

Sorry that I cannot release the checkpoint files, or the "actual" training code due to some NDA reasons. However, the efficientnet model is basically the same.

Your finding on efficientnet-b0/model/head/conv2d/Conv2D layer is a breakthrough! According to the model on [official site]:(https://github.com/tensorflow/tpu/blob/04f6bb6da1502c3551ed4500a4ee396e06243561/models/official/efficientnet/efficientnet_model.py#L676)

I place the following checking code for the exact shape on head Conv2D:

outputs = self._conv_head(outputs)
print("*** Head Conv2D shape", outputs.shape)
outputs = self._bn1(outputs, training=training)
print("*** Head BN1 shape", outputs.shape)
outputs = self._relu_fn(outputs)
#outputs = self._relu_fn(
#    self._bn1(self._conv_head(outputs), training=training))

The results are shown:

*** Head Conv2D shape (1, 320, 16, 1280)
*** Head BN1 shape (1, 320, 16, 1280)

~~The results are inconsistent with those in converted frozen graph model~~

~~I am looking into my model conversion script. I will keep this issue update if I find any solution/workarounds.~~

Thanks for your advices

B.R., Kelvin

KodeWorker commented 2 years ago

Silly me! :(

The filter shape on netron page is different from the output shape of efficientnet-b0/model/head/conv2d/Conv2D I change the L147 in my converting script to:

freeze_graph.freeze_graph_with_def_protos(tf.get_default_graph().as_graph_def(add_shapes=True),

The _output_shapes can be found on netron.

Conv2D
NB

The output shapes of Conv2D seems OK. Maybe somthing to do with FusedBatchNormV3 ...

KodeWorker commented 2 years ago

I'v update the custom frozen graph

The error messages are similar to my original post

Model Optimizer arguments:
Common parameters:
        - Path to the Input Model:      /opt/intel/openvino_2021.4.752/efficientnet-b0.pb
        - Path for generated IR:        /opt/intel/openvino_2021.4.752/IR
        - IR output name:       efficientnet-b0
        - Log level:    ERROR
        - Batch:        Not specified, inherited from the model
        - Input layers:         Not specified, inherited from the model
        - Output layers:        Not specified, inherited from the model
        - Input shapes:         Not specified, inherited from the model
        - Mean values:  Not specified
        - Scale values:         Not specified
        - Scale factor:         Not specified
        - Precision of IR:      FP32
        - Enable fusing:        True
        - Enable grouped convolutions fusing:   True
        - Move mean values to preprocess section:       None
        - Reverse input channels:       False
TensorFlow specific parameters:
        - Input model in text protobuf format:  False
        - Path to model dump for TensorBoard:   None
        - List of shared libraries with TensorFlow custom layers implementation:        None
        - Update the configuration file with input/output node names:   None
        - Use configuration file used to generate the model with Object Detection API:  None
        - Use the config file:  None
        - Inference Engine found in:    /opt/intel/openvino_2021.4.752/python/python3.6/openvino
Inference Engine version:       2021.4.2-3974-e2a469a3450-releases/2021/4
Model Optimizer version:        2021.4.2-3974-e2a469a3450-releases/2021/4
Progress: [..............      ]  72.91% done/home/openvino/.local/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
[ ERROR ]  Exception occurred during running replacer "fusing" (<class 'extensions.middle.fusings.Fusing'>): After partial shape inference were found shape collision for node efficientnet-b0/model/stem/tpu_batch_normalization/FusedBatchNormV3/beta (old shape: [  1  32 256 256], new shape: [  1  32 256  -1])

KodeWorker commented 2 years ago

Hi @jgespino,

The shape mismatch may be caused by incorrect laycout when converting batch normalization.

I placed the following checking code in decomposition.py to make sure that FusedBatchNormV3 is converted in the correct layout

print("layour => ", graph.graph['layout'])

https://github.com/openvinotoolkit/openvino/blob/59cfdce73b3a97186956a7229a21b0d7aee010cc/tools/mo/openvino/tools/mo/middle/passes/fusing/decomposition.py#L61

I built OpenVINO from source (2022.1.0.dev20220218), and then tried to convert the frozen graph model. It seems that batch normalization is converted using unexpected layout "NHWC" while the data_format in frozen graph is "NCHW".

How can I fix this? Is there any pointers?

B.R., Kelvin

jgespino commented 2 years ago

Hi @KodeWorker

Thank you for looking into the two ops in your model. Let me reach out to the development team to see if this is a bug in MO.

Regards, Jesus

Ref. 80974

jgespino commented 2 years ago

Hi @KodeWorker

Could you double check your TensorFlow model? We tried to infer using TensorFlow and are running into the following error.

ValueError: Dimensions must be equal, but are 1280 and 320 for 'node import/efficientnet-b0/model/head/tpu_batch_normalization/FusedBatchNormV3 =
FusedBatchNormV3[T=DT_FLOAT, U=DT_FLOAT, _output_shapes=[[1,16,16,1280], [1280], [1280], [1280], [1280], <unknown>], data_format="NHWC", epsilon=0.001,
exponential_avg_factor=1, is_training=false](import/efficientnet-b0/model/head/conv2d/Conv2D, import/efficientnet-b0/model/head/tpu_batch_normalization/ReadVariableOp,
import/efficientnet-b0/model/head/tpu_batch_normalization/ReadVariableOp_1, import/efficientnet-b0/model/head/tpu_batch_normalization/FusedBatchNormV3/ReadVariableOp,
import/efficientnet-b0/model/head/tpu_batch_normalization/FusedBatchNormV3/ReadVariableOp_1)' with input shapes: [1,16,16,1280], [320], [320], [320], [320].

Regards, Jesus

KodeWorker commented 2 years ago

Hi @jgespino,

I did not get the error message using TF 2.5., but is bound to happend when using TF1.15 BTW, it also failed with TF1.15 when I tried to convert frozen model to ONNX, while TF2.5 did't. Is this something to do with TensorFlow?

B.R., Kelvin

jgespino commented 2 years ago

Hi @KodeWorker

Development team has created a pull request to address this issue. Please take a look: https://github.com/openvinotoolkit/openvino/pull/11084

Regards, Jesus

KodeWorker commented 2 years ago

Hi @jgespino ,

Thank you for the help. I am looking forward to this update.

For temporary fix, the walkaround is converting .pb model to .onnx using TF2.5+tf2onnx and then converting *.onnx to IR.

B.R., Kelvin