Open p-durandin opened 7 months ago
@p-durandin Hi, I wonder if I can work on this issue without Intel GPU?
.take
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
@awayzjj - since you taken it - I assume you have Intel GPU? :)
@yury-gorbachev Tomorrow I will try to find a computer with an Intel GPU, or I will give up this ticket.:)
@yury-gorbachev Tomorrow I will try to find a computer with an Intel GPU, or I will give up this ticket.:)
Any Intel SoC laptop should work, it typically has iGPU on it...
@yury-gorbachev @p-durandin @sshlyapn I conducted benchmarking on a model with only one if operation as follows, howerver I have 2 questions:
What is meant by "the reported time is almost zero" in the issue context, yet the Total time
marked with underscores in the provided picture is 566.967 milliseconds
?
Why do you think "Performance counter numbers work incorrectly"? What should be the correct relationship between them? In the example you provided, they are respectively 566.967 ms vs. 4398.12 ms, while in my example, they are 0.002 ms vs. 2.79 ms."
$ ./bin/intel64/Release/benchmark_app -m ./tmp/if.xml -d GPU -pc
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.2.0-15181-7ca4f07a3c4
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.2.0-15181-7ca4f07a3c4
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(GPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 3.36 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Network inputs:
[ INFO ] cond (node: cond) : boolean / [...] / []
[ INFO ] Network outputs:
[ INFO ] res (node: res) : f32 / [...] / [5]
[Step 5/11] Resizing model to match image sizes and given batch
[ WARNING ] cond: layout is not set explicitly. It is STRONGLY recommended to set layout manually to avoid further issues.
[Step 6/11] Configuring input of the model
[ WARNING ] No batch dimension was found, asssuming batch to be 1. Beware: this might affect FPS calculation.
[ INFO ] Model batch size: 1
[ INFO ] Network inputs:
[ INFO ] cond (node: cond) : boolean / [...] / []
[ INFO ] Network outputs:
[ INFO ] res (node: res) : f32 / [...] / [5]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 599.08 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: test_if
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] PERF_COUNT: YES
[ INFO ] ENABLE_CPU_PINNING: NO
[ INFO ] MODEL_PRIORITY: MEDIUM
[ INFO ] GPU_HOST_TASK_PRIORITY: MEDIUM
[ INFO ] GPU_QUEUE_PRIORITY: MEDIUM
[ INFO ] GPU_QUEUE_THROTTLE: MEDIUM
[ INFO ] GPU_ENABLE_LOOP_UNROLLING: YES
[ INFO ] GPU_DISABLE_WINOGRAD_CONVOLUTION: NO
[ INFO ] CACHE_DIR:
[ INFO ] CACHE_MODE: optimize_speed
[ INFO ] PERFORMANCE_HINT: THROUGHPUT
[ INFO ] EXECUTION_MODE_HINT: PERFORMANCE
[ INFO ] COMPILATION_NUM_THREADS: 8
[ INFO ] NUM_STREAMS: 2
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] INFERENCE_PRECISION_HINT: f16
[ INFO ] DEVICE_ID: 0
[ INFO ] EXECUTION_DEVICES: GPU.0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given: all inputs will be filled with random values!
[ INFO ] Test Config 0
[ INFO ] cond ([...], boolean, [], static): random (binary data/numpy array is expected)
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 0.35 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counts for 0-th infer request:
Convert_2253 OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
If_36 EXECUTED layerType: If execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
cond EXECUTED layerType: Parameter execType: wait_for_events__u8 realTime (ms): 0.000 cpuTime (ms): 0.002
res OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
res/sink_port_0 EXECUTED layerType: Result execType: reorder_data__f32 realTime (ms): 0.002 cpuTime (ms): 0.007
Total time: 0.002 milliseconds
Total CPU time: 0.009 milliseconds
Full device name: Intel(R) UHD Graphics (iGPU)
[ INFO ] Performance counts for 1-th infer request:
Convert_2253 OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
If_36 EXECUTED layerType: If execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
cond EXECUTED layerType: Parameter execType: wait_for_events__u8 realTime (ms): 0.000 cpuTime (ms): 0.002
res OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
res/sink_port_0 EXECUTED layerType: Result execType: reorder_data__f32 realTime (ms): 0.002 cpuTime (ms): 0.007
Total time: 0.002 milliseconds
Total CPU time: 0.009 milliseconds
Full device name: Intel(R) UHD Graphics (iGPU)
[ INFO ] Performance counts for 2-th infer request:
Convert_2253 OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
If_36 EXECUTED layerType: If execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
cond EXECUTED layerType: Parameter execType: wait_for_events__u8 realTime (ms): 0.000 cpuTime (ms): 0.002
res OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
res/sink_port_0 EXECUTED layerType: Result execType: reorder_data__f32 realTime (ms): 0.002 cpuTime (ms): 0.007
Total time: 0.002 milliseconds
Total CPU time: 0.009 milliseconds
Full device name: Intel(R) UHD Graphics (iGPU)
[ INFO ] Performance counts for 3-th infer request:
Convert_2253 OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
If_36 EXECUTED layerType: If execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
cond EXECUTED layerType: Parameter execType: wait_for_events__u8 realTime (ms): 0.000 cpuTime (ms): 0.002
res OPTIMIZED_OUT layerType: Convert execType: undef realTime (ms): 0.000 cpuTime (ms): 0.000
res/sink_port_0 EXECUTED layerType: Result execType: reorder_data__f32 realTime (ms): 0.002 cpuTime (ms): 0.007
Total time: 0.002 milliseconds
Total CPU time: 0.009 milliseconds
Full device name: Intel(R) UHD Graphics (iGPU)
[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count: 686692 iterations
[ INFO ] Duration: 60000.73 ms
[ INFO ] Latency:
[ INFO ] Median: 0.38 ms
[ INFO ] Average: 0.34 ms
[ INFO ] Min: 0.22 ms
[ INFO ] Max: 2.79 ms
[ INFO ] Throughput: 11444.73 FPS
I would greatly appreciate any guidance or suggestions you can offer. Thank you!
Hi @awayzjj , thank you for checking the issue!
Let me add more details. The issue is connected with an incorrect performance profiling report for the IF operation. It is expected that the IF operation aggregates the execution time of all operations contained in its inner body. However, as I can see, it reports the execution time only for the last primitive in the IF’s inner body.
Regarding your questions:
So, to debug this issue, you should use a simple model with an IF operation, the model should contain at least two operations in its else/then body (the type of operations doesn’t matter here, but it's better to use the bigger shapes just to make it more visible)
hey @awayzjj will you work on this? (you have many tickets assigned)
Hello! If no one minds, I am going to assign the issue to myself.
I have managed to reproduce the issue with two graphs - one with multiple add operations in an if layer and another with the same additions outside of the if layer.
I get the following results on gpu: With an if layer.
Total time: 8.329 milliseconds
Total CPU time: 0.023 milliseconds
Full device name: Intel(R) HD Graphics 520 (iGPU)
[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count: 3296 iterations
[ INFO ] Duration: 60005.03 ms
[ INFO ] Latency:
[ INFO ] Median: 18.01 ms
[ INFO ] Average: 18.20 ms
[ INFO ] Min: 17.71 ms
[ INFO ] Max: 26.25 ms
[ INFO ] Throughput: 54.93 FPS
Without an if layer:
Total time: 15.189 milliseconds
Total CPU time: 0.018 milliseconds
Full device name: Intel(R) HD Graphics 520 (iGPU)
[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count: 3811 iterations
[ INFO ] Duration: 60000.57 ms
[ INFO ] Latency:
[ INFO ] Median: 15.48 ms
[ INFO ] Average: 15.74 ms
[ INFO ] Min: 15.25 ms
[ INFO ] Max: 21.93 ms
[ INFO ] Throughput: 63.52 FPS
When I run the models on the cpu 'Total time' is the same for both of them. Do I understand correctly that the difference in Total time demonstrates the issue?
.take
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
@sshlyapn @p-durandin
Hello! I seem to have found the problem. We returned only events corresponding to output operations while ignoring the rest of the graph from the conditional operator.
With this fix I get following results on the same networks:
if:
Total time: 14.464 milliseconds
Total CPU time: 0.045 milliseconds
Full device name: Intel(R) HD Graphics 520 (iGPU)
[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count: 3882 iterations
[ INFO ] Duration: 60015.44 ms
[ INFO ] Latency:
[ INFO ] Median: 14.61 ms
[ INFO ] Average: 15.42 ms
[ INFO ] Min: 14.37 ms
[ INFO ] Max: 26.68 ms
[ INFO ] Throughput: 64.68 FPS
no if:
Total time: 14.690 milliseconds
Total CPU time: 0.018 milliseconds
Full device name: Intel(R) HD Graphics 520 (iGPU)
[ INFO ] Execution Devices: [ GPU.0 ]
[ INFO ] Count: 3909 iterations
[ INFO ] Duration: 60028.67 ms
[ INFO ] Latency:
[ INFO ] Median: 15.11 ms
[ INFO ] Average: 15.31 ms
[ INFO ] Min: 14.83 ms
[ INFO ] Max: 25.56 ms
[ INFO ] Throughput: 65.12 FPS
So total time is more or less the same.
@mamanain please provide results with "Performance counters" flag -pc
@mamanain - do you need any support or you don't have to finish this task?
Context
Performance counter numbers works incorrectly with if operation. Using models with if layer, the reported time is almost zero, and there is a big delta between the some of the latencies of the operations vs. the end-to-end latency reported by benchmark_app.
Counter statistics are critical for performance analysis and optimizations
Below "total time" is reported by -pc, whereas "Latency" is computed as end-to-end inference latency by benchmark_app.
What needs to be done?
Fix performance counters in if operation to see adequate layer statistics
Example Pull Requests
No response
Resources
Contact points
sshlyapn, p-durandin
Ticket
No response