openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.83k stars 2.18k forks source link

[Bug]: GroupNormalization crashes openvino model benchmark on NPU #24462

Open EmbeddedPaul166 opened 4 months ago

EmbeddedPaul166 commented 4 months ago

OpenVINO Version

2024.1.0

Operating System

Windows System

Device used for inference

NPU

Framework

Keras (TensorFlow 2)

Model used

Custom

Issue description

Consider a following workflow:

  1. Custom convolutional model with dynamic shape and GroupNormalization layer is created in Keras Tensorflow and saved in the saved_model format.
  2. Using model optimizer it's input shape is converted to constant and saved_model.bin and .xml files are generated.
  3. Using benchmark_app model is ran on NPU.

Problem: Adding GroupNormalization layer makes benchmark_app crash on NPU.

Tests were performed on a laptop with Intel Core Ultra 7 155H CPU. Tensorflow version was 2.14.0.

Step-by-step reproduction

Step 1: Model creation in tensorflow

import tensorflow as tf inp = tf.keras.Input((None, None, 1), dtype=tf.float32) y = tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu')(inp) y = tf.keras.layers.GroupNormalization(4)(y) model = tf.keras.Model(inputs=inp, outputs=y) tf.keras.models.save_model(model, 'test_model')

Step 2: Model conversion to openvino format

mo.exe --saved_model_dir .\test_model --input input_1 --input_shape [1,256,256,1]

Step 3: Performing benchmark on NPU

benchmark_app.exe -m saved_model.xml -hint ctput -data_shape "[1, 256, 256, 1]" -inference_only -report_type detailed_counters -d NPU

Relevant log output

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.1.0-15008-f4afc983258-releases/2024/1
[ INFO ]
[ INFO ] Device info:
[ INFO ] NPU
[ INFO ] Build ................................. 2024.1.0-15008-f4afc983258-releases/2024/1
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for NPU device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 9.01 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_1 (node: input_1) : f32 / [...] / [1,256,256,1]
[ INFO ] Model outputs:
[ INFO ]     group_normalization (node: model/group_normalization/Reshape_4) : f32 / [...] / [1,256,256,32]
[Step 5/11] Resizing model to match image sizes and given batch
[ WARNING ] Input 'input_1' has static shape. Provided data shapes for this input will be ignored.
[ INFO ] Model batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_1 (node: input_1) : f32 / [N,H,W,C] / [1,256,256,1]
[ INFO ] Model outputs:
[ INFO ]     group_normalization (node: model/group_normalization/Reshape_4) : f32 / [...] / [1,256,256,32]
[Step 7/11] Loading the model to the device
loc(fused["model/group_normalization/batchnorm/add_1", "t_Add"]): error: Reshape has incompatible output shape as clustering: in type = !VPUIP.DistributedBuffer<1x4x8x1xf16, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, @CMX_NN, {mode = "SEGMENTED", num_tiles = [1, 1, 2, 1], num_clusters = 2 : i64}>, out type = !VPUIP.DistributedBuffer<1x1x4x8xf16, affine_map<(d0, d1, d2, d3) -> (d0, d1, d3, d2)>, @CMX_NN, {mode = "SEGMENTED", num_tiles = [1, 1, 2, 1], num_clusters = 2 : i64}>
[ ERROR ] Exception from src\inference\src\cpp\core.cpp:109:
Exception from src\inference\src\dev\plugin.cpp:54:
Exception from src\plugins\intel_npu\src\plugin\src\plugin.cpp:513:
Check 'result == ZE_RESULT_SUCCESS' failed at src\plugins\intel_npu\src\compiler\src\zero_compiler_in_driver.cpp:753:
Failed to compile network. L0 createGraph result: ZE_RESULT_ERROR_INVALID_ARGUMENT, code 0x78000004. Compilation failed
Failed to create executable

Traceback (most recent call last):
  File "C:\Users\user\micromamba\envs\opvtest\lib\site-packages\openvino\tools\benchmark\main.py", line 408, in main
    compiled_model = benchmark.core.compile_model(model, benchmark.device, device_config)
  File "C:\Users\user\micromamba\envs\opvtest\lib\site-packages\openvino\runtime\ie_api.py", line 521, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src\inference\src\cpp\core.cpp:109:
Exception from src\inference\src\dev\plugin.cpp:54:
Exception from src\plugins\intel_npu\src\plugin\src\plugin.cpp:513:
Check 'result == ZE_RESULT_SUCCESS' failed at src\plugins\intel_npu\src\compiler\src\zero_compiler_in_driver.cpp:753:
Failed to compile network. L0 createGraph result: ZE_RESULT_ERROR_INVALID_ARGUMENT, code 0x78000004. Compilation failed
Failed to create executable

[ INFO ] Statistics report is stored to benchmark_report.csv

Issue submission checklist

EmbeddedPaul166 commented 3 months ago

Any updates on this?

dziulek commented 2 months ago

Have the same issue with GroupNormalization. I can't go further with my research without this fix. Hope You will find a solution.

avitial commented 1 month ago

Ref. 149211

avitial commented 3 weeks ago

@EmbeddedPaul166 performed a quick test with the provided steps on a MTL with NPU (Intel Core Ultra 7 155H) and the issue is not observed. Please try using the latest OpenVINO version 2024.3 and the latest NPU driver and see if the issue is fixed on your end. You can refer to the model conversion to OpenVINO IR in the code snippet below. Hope this helps.

@dziulek please try also on your end with the latest OpenVINO/NPU driver. If the issue persists please share a sample reproducer (model definition, conversion steps, application code).

# tf model definition
import tensorflow as tf

inp = tf.keras.Input((None, None, 1), dtype=tf.float32)
y = tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu')(inp)
y = tf.keras.layers.GroupNormalization(4)(y)
model = tf.keras.Model(inputs=inp, outputs=y)
tf.keras.models.save_model(model, 'test_model.keras')

# model conversion to OpenVINO IR
import tensorflow as tf
from openvino import convert_model, save_model

model_1_path = "test_model.keras"
model = tf.keras.models.load_model(model_1_path)
model.export('test_model')
ov_model = convert_model('test_model', input=("input_layer", [1,256,256,1]))
save_model(ov_model, 'test_model.xml')
$ benchmark_app -m test_model.xml -d NPU -t 5
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.3.0-16041-1e3b88e4e3f-releases/2024/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] NPU
[ INFO ] Build ................................. 2024.3.0-16041-1e3b88e4e3f-releases/2024/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(NPU) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 2.54 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_layer (node: input_layer) : f32 / [...] / [1,256,256,1]
[ INFO ] Model outputs:
[ INFO ]     output_0 (node: functional_1/group_normalization_1/Reshape_3) : f32 / [...] / [1,256,256,32]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_layer (node: input_layer) : f32 / [N,H,W,C] / [1,256,256,1]
[ INFO ] Model outputs:
[ INFO ]     output_0 (node: functional_1/group_normalization_1/Reshape_3) : f32 / [...] / [1,256,256,32]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 468.60 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   DEVICE_ID:
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   EXECUTION_DEVICES: NPU
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   INFERENCE_PRECISION_HINT: <Type: 'float16'>
[ INFO ]   LOADED_FROM_CACHE: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   NETWORK_NAME: TensorFlow_Frontend_IR
[ INFO ]   NPU_COMPILATION_MODE_PARAMS:
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 1
[ INFO ]   PERF_COUNT: False
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_layer'!. This input will be filled with random values!
[ INFO ] Fill input 'input_layer' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 5000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 150.87 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:NPU
[ INFO ] Count:            44 iterations
[ INFO ] Duration:         5547.39 ms
[ INFO ] Latency:
[ INFO ]    Median:        501.86 ms
[ INFO ]    Average:       486.99 ms
[ INFO ]    Min:           145.57 ms
[ INFO ]    Max:           523.53 ms
[ INFO ] Throughput:   7.93 FPS
EmbeddedPaul166 commented 2 days ago

It works now, thank you :D