openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.79k stars 2.16k forks source link

[Performance]: Latency is much longer than RealTime when using benchmark_app for profiling #24517

Open cold-blue opened 3 months ago

cold-blue commented 3 months ago

OpenVINO Version

No response

Operating System

Windows System

Device used for inference

None

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

N/A

Model quantization

Yes

Target Platform

No response

Performance issue description

When I am using benchmark_app to profile a model. I find that the latency in the report is much longer than the total RealTime computed in detailed_counter_report.csv. In report.csv: image In detailed_counter_report.csv: image

I wonder whether this is a normal phenomenon and what is the reason?

Step-by-step reproduction

No response

Issue submission checklist

dmitry-gorokhov commented 3 months ago

Hi @cold-blue! What device you used for inference? CPU, GPU?

Jessielovecodings commented 3 months ago

.take

github-actions[bot] commented 3 months ago

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

Jessielovecodings commented 3 months ago

hi @cold-blue When you use benchmark_app to profile model performance and find that the latency reported in the report is much longer than the total RealTime computed in detailed_counter_report.csv, this could be due to the following reasons:

  1. Initial loading and initialization time: benchmark_app requires loading the model and performing some initialization operations before starting the evaluation. This process includes loading the model file, initializing the plugin, allocating memory, etc. These times are usually included in the total latency reported but are not included in the specific inference time statistics.

    1. Data preprocessing and postprocessing time: The total latency reported may include data preprocessing and postprocessing time, whereas detailed_counter_report.csv primarily records the actual execution time of the inference engine.

    2. Differences in measurement methods: Different timing methods may result in time differences. For example, the time in the report may be the total time from the start to the end of the application, while the time in detailed_counter_report.csv may only record the inference process time.

  2. Batch size and synchronous/asynchronous execution: If batch processing or asynchronous execution mode is used, the actual inference time statistics and the total latency time may differ because the asynchronous mode timing may include more synchronization operation time.

How to solve:

  1. Break down the time overhead: By adding logs or further analysis, break down the time overhead of each step, including model loading, initialization, data preprocessing, inference process, data postprocessing, etc. This will help identify which steps are more time-consuming.

  2. Compare execution time in different modes: Try running benchmark_app in both synchronous and asynchronous modes and compare the execution times to analyze the reasons for the time differences.

    1. Check and optimize the initialization process: If the initialization process is time-consuming, consider optimizing the model loading and initialization steps, such as reducing the model size, optimizing the model file, etc.
  3. Verify data processing steps: Ensure the efficiency of data preprocessing and postprocessing steps to avoid unnecessary time overhead. If the data processing steps are time-consuming, consider optimizing these steps.

Here is an example code to break down the time overhead of each step:

import time
from openvino.inference_engine import IECore

#Record start time
start_time = time.time()

#Create IE instance
ie = IECore()
ie_creation_time = time.time()

#Read model
net = ie.read_network(model="model.xml", weights="model.bin")
read_network_time = time.time()

#Load model to device
exec_net = ie.load_network(network=net, device_name="CPU")
load_network_time = time.time()

#Prepare input data
input_data = prepare_input_data()  # This is a placeholder, replace with actual data preparation steps
prepare_input_data_time = time.time()

#Perform inference
results = exec_net.infer(inputs={input_blob_name: input_data})
infer_time = time.time()

#Record total time
end_time = time.time()

#Output time statistics
print(f"Total time: {end_time - start_time:.2f} seconds")
print(f"IE creation time: {ie_creation_time - start_time:.2f} seconds")
print(f"Read network time: {read_network_time - ie_creation_time:.2f} seconds")
print(f"Load network time: {load_network_time - read_network_time:.2f} seconds")
print(f"Prepare input data time: {prepare_input_data_time - load_network_time:.2f} seconds")
print(f"Infer time: {infer_time - prepare_input_data_time:.2f} seconds")

By using this approach, you can get detailed insights into the time taken for each step and perform targeted optimizations. This will help you better understand and resolve the issue of long latency times.

Jessielovecodings commented 3 months ago

@cold-blue Summary

  1. Break down the latency into specific steps to identify where the most time is being spent.
  2. Compare synchronous and asynchronous execution modes to understand their impact on total latency.
  3. Optimize the initialization, data preprocessing, and postprocessing steps to reduce overhead.
  4. Ensure measurement consistency by understanding the differences between the total time and detailed per-inference time.