[Good First Issue]: if operation is incorrectly reporting profiling data in GPU

p-durandin commented 7 months ago

Context

Performance counter numbers works incorrectly with if operation. Using models with if layer, the reported time is almost zero, and there is a big delta between the some of the latencies of the operations vs. the end-to-end latency reported by benchmark_app.

Counter statistics are critical for performance analysis and optimizations

Below "total time" is reported by -pc, whereas "Latency" is computed as end-to-end inference latency by benchmark_app. Counter

What needs to be done?

Fix performance counters in if operation to see adequate layer statistics

Example Pull Requests

No response

Resources

Contribution guide - start here!
Intel DevHub Discord channel - engage in discussions, ask questions and talk to OpenVINO developers
How to link your Pull Request to an issue

Contact points

sshlyapn, p-durandin

Ticket

No response

awayzjj commented 7 months ago

@p-durandin Hi, I wonder if I can work on this issue without Intel GPU?

awayzjj commented 7 months ago

.take

github-actions[bot] commented 7 months ago

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

yury-gorbachev commented 7 months ago

@awayzjj - since you taken it - I assume you have Intel GPU? :)

awayzjj commented 7 months ago

@yury-gorbachev Tomorrow I will try to find a computer with an Intel GPU, or I will give up this ticket.:)

yury-gorbachev commented 7 months ago

@yury-gorbachev Tomorrow I will try to find a computer with an Intel GPU, or I will give up this ticket.:)

Any Intel SoC laptop should work, it typically has iGPU on it...

awayzjj commented 6 months ago

@yury-gorbachev @p-durandin @sshlyapn I conducted benchmarking on a model with only one if operation as follows, howerver I have 2 questions:

What is meant by "the reported time is almost zero" in the issue context, yet the Total time marked with underscores in the provided picture is 566.967 milliseconds?
Why do you think "Performance counter numbers work incorrectly"? What should be the correct relationship between them? In the example you provided, they are respectively 566.967 ms vs. 4398.12 ms, while in my example, they are 0.002 ms vs. 2.79 ms."

$ ./bin/intel64/Release/benchmark_app -m ./tmp/if.xml -d GPU -pc
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.2.0-15181-7ca4f07a3c4
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.2.0-15181-7ca4f07a3c4
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(GPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 3.36 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Network inputs:
[ INFO ]     cond (node: cond) : boolean / [...] / []
[ INFO ] Network outputs:
[ INFO ]     res (node: res) : f32 / [...] / [5]
[Step 5/11] Resizing model to match image sizes and given batch
[ WARNING ] cond: layout is not set explicitly. It is STRONGLY recommended to set layout manually to avoid further issues.
[Step 6/11] Configuring input of the model
[ WARNING ] No batch dimension was found, asssuming batch to be 1. Beware: this might affect FPS calculation.
[ INFO ] Model batch size: 1
[ INFO ] Network inputs:
[ INFO ]     cond (node: cond) : boolean / [...] / []
[ INFO ] Network outputs:
[ INFO ]     res (node: res) : f32 / [...] / [5]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 599.08 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: test_if
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: YES
[ INFO ]   ENABLE_CPU_PINNING: NO
[ INFO ]   MODEL_PRIORITY: MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: YES
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: NO
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: optimize_speed
[ INFO ]   PERFORMANCE_HINT: THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 8
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DEVICE_ID: 0
[ INFO ]   EXECUTION_DEVICES: GPU.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given: all inputs will be filled with random values!
[ INFO ] Test Config 0
[ INFO ] cond  ([...], boolean, [], static):    random (binary data/numpy array is expected)
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 0.35 ms

[Step 11/11] Dumping statistics report
[ INFO ] Performance counts for 0-th infer request:
Convert_2253         OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
If_36                EXECUTED             layerType: If                   execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
cond                 EXECUTED             layerType: Parameter            execType: wait_for_events__u8  realTime (ms): 0.000      cpuTime (ms): 0.002
res                  OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
res/sink_port_0      EXECUTED             layerType: Result               execType: reorder_data__f32    realTime (ms): 0.002      cpuTime (ms): 0.007
Total time:              0.002 milliseconds
Total CPU time:          0.009 milliseconds

Full device name: Intel(R) UHD Graphics (iGPU)

[ INFO ] Performance counts for 1-th infer request:
Convert_2253         OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
If_36                EXECUTED             layerType: If                   execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
cond                 EXECUTED             layerType: Parameter            execType: wait_for_events__u8  realTime (ms): 0.000      cpuTime (ms): 0.002
res                  OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
res/sink_port_0      EXECUTED             layerType: Result               execType: reorder_data__f32    realTime (ms): 0.002      cpuTime (ms): 0.007
Total time:              0.002 milliseconds
Total CPU time:          0.009 milliseconds

Full device name: Intel(R) UHD Graphics (iGPU)

[ INFO ] Performance counts for 2-th infer request:
Convert_2253         OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
If_36                EXECUTED             layerType: If                   execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
cond                 EXECUTED             layerType: Parameter            execType: wait_for_events__u8  realTime (ms): 0.000      cpuTime (ms): 0.002
res                  OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
res/sink_port_0      EXECUTED             layerType: Result               execType: reorder_data__f32    realTime (ms): 0.002      cpuTime (ms): 0.007
Total time:              0.002 milliseconds
Total CPU time:          0.009 milliseconds

Full device name: Intel(R) UHD Graphics (iGPU)

[ INFO ] Performance counts for 3-th infer request:
Convert_2253         OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
If_36                EXECUTED             layerType: If                   execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
cond                 EXECUTED             layerType: Parameter            execType: wait_for_events__u8  realTime (ms): 0.000      cpuTime (ms): 0.002
res                  OPTIMIZED_OUT        layerType: Convert              execType: undef                realTime (ms): 0.000      cpuTime (ms): 0.000
res/sink_port_0      EXECUTED             layerType: Result               execType: reorder_data__f32    realTime (ms): 0.002      cpuTime (ms): 0.007
Total time:              0.002 milliseconds
Total CPU time:          0.009 milliseconds

Full device name: Intel(R) UHD Graphics (iGPU)

[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count:               686692 iterations
[ INFO ] Duration:            60000.73 ms
[ INFO ] Latency:
[ INFO ]    Median:           0.38 ms
[ INFO ]    Average:          0.34 ms
[ INFO ]    Min:              0.22 ms
[ INFO ]    Max:              2.79 ms
[ INFO ] Throughput:          11444.73 FPS

I would greatly appreciate any guidance or suggestions you can offer. Thank you!

sshlyapn commented 6 months ago

Hi @awayzjj , thank you for checking the issue!

Let me add more details. The issue is connected with an incorrect performance profiling report for the IF operation. It is expected that the IF operation aggregates the execution time of all operations contained in its inner body. However, as I can see, it reports the execution time only for the last primitive in the IF’s inner body.

Regarding your questions:

I believe the original description meant not "the reported time is almost zero", but rather "the IF operator doesn’t display the real execution time of all underlying operations".
For checking the correctness of performance profiling, I think it's better to use the benchmark_app with the “-hint latency” option. This will run only a single inference request, and in that case, we can expect that the reported “Total time” metric from per-layer profiling would be more or less close to the final latency and throughput results reported by benchmark_app.

So, to debug this issue, you should use a simple model with an IF operation, the model should contain at least two operations in its else/then body (the type of operations doesn’t matter here, but it's better to use the bigger shapes just to make it more visible)

mlukasze commented 4 months ago

hey @awayzjj will you work on this? (you have many tickets assigned)

mamanain commented 2 months ago

Hello! If no one minds, I am going to assign the issue to myself.

I have managed to reproduce the issue with two graphs - one with multiple add operations in an if layer and another with the same additions outside of the if layer.

I get the following results on gpu: With an if layer.

Total time:              8.329 milliseconds
Total CPU time:          0.023 milliseconds

Full device name: Intel(R) HD Graphics 520 (iGPU)

[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count:               3296 iterations
[ INFO ] Duration:            60005.03 ms
[ INFO ] Latency:
[ INFO ]    Median:           18.01 ms
[ INFO ]    Average:          18.20 ms
[ INFO ]    Min:              17.71 ms
[ INFO ]    Max:              26.25 ms
[ INFO ] Throughput:          54.93 FPS

Without an if layer:

Total time:              15.189 milliseconds
Total CPU time:          0.018 milliseconds

Full device name: Intel(R) HD Graphics 520 (iGPU)

[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count:               3811 iterations
[ INFO ] Duration:            60000.57 ms
[ INFO ] Latency:
[ INFO ]    Median:           15.48 ms
[ INFO ]    Average:          15.74 ms
[ INFO ]    Min:              15.25 ms
[ INFO ]    Max:              21.93 ms
[ INFO ] Throughput:          63.52 FPS

When I run the models on the cpu 'Total time' is the same for both of them. Do I understand correctly that the difference in Total time demonstrates the issue?

mamanain commented 2 months ago

.take

github-actions[bot] commented 2 months ago

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

mamanain commented 2 months ago

@sshlyapn @p-durandin

Hello! I seem to have found the problem. We returned only events corresponding to output operations while ignoring the rest of the graph from the conditional operator.

With this fix I get following results on the same networks:

if:

Total time:              14.464 milliseconds
Total CPU time:          0.045 milliseconds

Full device name: Intel(R) HD Graphics 520 (iGPU)

[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count:               3882 iterations
[ INFO ] Duration:            60015.44 ms
[ INFO ] Latency:
[ INFO ]    Median:           14.61 ms
[ INFO ]    Average:          15.42 ms
[ INFO ]    Min:              14.37 ms
[ INFO ]    Max:              26.68 ms
[ INFO ] Throughput:          64.68 FPS

no if:

Total time:              14.690 milliseconds
Total CPU time:          0.018 milliseconds

Full device name: Intel(R) HD Graphics 520 (iGPU)

[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count:               3909 iterations
[ INFO ] Duration:            60028.67 ms
[ INFO ] Latency:
[ INFO ]    Median:           15.11 ms
[ INFO ]    Average:          15.31 ms
[ INFO ]    Min:              14.83 ms
[ INFO ]    Max:              25.56 ms
[ INFO ] Throughput:          65.12 FPS

So total time is more or less the same.

p-durandin commented 2 months ago

@mamanain please provide results with "Performance counters" flag -pc

mlukasze commented 2 weeks ago

@mamanain - do you need any support or you don't have to finish this task?

openvinotoolkit / openvino