OpenCL kernel long time to start execution after enqueue on iGPU

arshadlab commented 1 month ago

Hi, I am experimenting with kernel execution on iGPU for solutions tailored for extremely low-latency workloads, where response times are critical down to the microsecond. While monitoring performance, I've noticed a significant delay from when kernels are queued to when they actually start executing. To isolate this issue, I implemented a basic OpenCL kernel to add two arrays of 256 bytes, queuing it with clEnqueueNDRangeKernel(). The API-level execution takes approximately 156 microseconds per call. However, using VTune's 'GPU Offload' analysis, I see that the kernel execution begins only after 131 microseconds and then runs for about 9 microseconds. This is with iGPU running at 300 Mhz frequency.

I observed that the GPU's operational frequency affects this latency significantly. At a low frequency of 300 MHz, the queue time is much longer compared to when the GPU is at its maximum frequency of 1300 MHz (queued duration drops to 33 microseconds, with a total time of 36 microseconds per call). Despite this improvement at higher frequencies, the initial delay before execution remains concerningly high at 33 microseconds, especially since there are no other compute tasks running simultaneously.

Given this scenario, I'm seeking insights into what might be causing these delays in compute execution startup. Additionally, what strategies could potentially reduce this latency, allowing the kernel to execute nearly immediately after being queued?

CPU Specs: 13th Gen Intel(R) Core(TM) i7-13700H (TDP 45W) on an Asus PN64-E OS: Ubuntu 22.04 ICD version: 23.52.28202.52-821~22.04

Thank you for any advice or insights you can provide!

Best regards, Arshad

Screenshot: iGPU running at Low Frequency 300 Mhz - (~150 us duration per call, actual kernel execution ~9 us) higher_latency_lower_frequency

iGPU running at Higher Frequency 1300 Mhz - (~37 us duration per call, actual kernel execution ~2 us) less_latency_high_frequency

compute_focus

BartusW commented 1 month ago

Arshad,

Default configuration (no GPU Frequency throttling) demonstrates that performance observations are correct. With your scenario and your hardware i7-13700H CPU and integrated GPU mentioned end-to-end timing is in expected range. VTune logs show buffer submission timing on ~37us. Scenario design, logs show that there is single clEnqueueNDRangeKernel() followed by clFinish() batching is excluded, it triggers workload submission to iGPU through the i915 KMD and hard host synchronization at clFinish call every iteration. VTune instrumentation adds ~5us overhead for CPU/GPU event view timestamps..

Reducing clock to non-default and low frequency value, performance score is affected by several factors. With low GPU clock command stream parsing is extended, same as main kernel execution from 2us to 9us in linear factor. Additionally, on ramp-down there could be timing coincidence with potential render-standby RC6 enter and exit triggered by next clEnque/clFinish call followed by compute context switch. With forced low GPU clock both RC6 and GPU context switch which operates in GPU domain would be more time expensive. Performance observation and impact is valid and expected from Compute Runtime UMD driver perspective.

To remove RC6 and batching / context switch effect which add abbreviations to your experiment, please redesign scenario with non-blocking clFlush() instead of clFinish() and disable GPU render standby RC6 in BIOS or on i915 module load time with via modprobe config file: options i915 enable_rc6=0

arshadlab commented 4 weeks ago

Hi, I am not able to set enable_rc6 to 0. Tried modifying the code but still it's enabled.

code: ./drivers/gpu/drm/i915/gt/intel_rc6.c bool enable_rc6 = true -> false;

$ cat /sys/class/drm/card0/power/rc6_enable 1

Seems like enable_rc6 option is removed since kernel 4.16 (see https://patchwork.freedesktop.org/patch/191386/).

Regards, Arshad

BartusW commented 3 weeks ago

Please use BIOS settings to disable RC6 GPU standby capabilities.

eero-t commented 2 weeks ago

What kernel i915 driver version is used? According to this, only out-of-tree driver (from Intel package repository) supports low latency submission: https://dgpu-docs.intel.com/driver/kernel-driver-types.html

intel / compute-runtime

OpenCL kernel long time to start execution after enqueue on iGPU #735