intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 734 forks source link

[E2E] Basic/event_profiling_info.cpp seems flaky #13591

Open uditagarwal97 opened 5 months ago

uditagarwal97 commented 5 months ago

Describe the bug

Failed run: https://github.com/intel/llvm/actions/runs/8886566095/job/24401423571?pr=13588 Successful run: https://github.com/intel/llvm/actions/runs/8886566095/job/24406513670

I observed this behavior L0 GPU on Windows, but now sure if we could also reproduce this flaky behavior on other Linux or devices.

FAIL: SYCL :: Basic/event_profiling_info.cpp (220 of 2017)
******************** TEST 'SYCL :: Basic/event_profiling_info.cpp' FAILED ********************
Exit Code: 3221226505

Command Output (stdout):
--
# RUN: at line 2
D:/github/actions-runner/_work/llvm/llvm/install/bin/clang++.exe   -fsycl -fsycl-targets=spir64 D:\github\actions-runner\_work\llvm\llvm\llvm\sycl\test-e2e\Basic\event_profiling_info.cpp -o D:\github\actions-runner\_work\llvm\llvm\build-e2e\Basic\Output\event_profiling_info.cpp.tmp.out
# executed command: D:/github/actions-runner/_work/llvm/llvm/install/bin/clang++.exe -fsycl -fsycl-targets=spir[64](https://github.com/intel/llvm/actions/runs/8886566095/job/24401423571?pr=13588#step:12:65) 'D:\github\actions-runner\_work\llvm\llvm\llvm\sycl\test-e2e\Basic\event_profiling_info.cpp' -o 'D:\github\actions-runner\_work\llvm\llvm\build-e2e\Basic\Output\event_profiling_info.cpp.tmp.out'
# RUN: at line 4
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  D:\github\actions-runner\_work\llvm\llvm\build-e2e\Basic\Output\event_profiling_info.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu 'D:\github\actions-runner\_work\llvm\llvm\build-e2e\Basic\Output\event_profiling_info.cpp.tmp.out'
# .---command stderr------------
# | Assertion failed: Submit <= Start, file D:/github/actions-runner/_work/llvm/llvm/llvm/sycl/test-e2e/Basic/event_profiling_info.cpp, line 30
# `-----------------------------
# error: command failed with exit status: 0xc0000409

To reproduce

DPC++ commit: c2cc3a1327f668795881a7b157388ad516bdd472

Environment

OS: Windows Device: L0 Gen12

sycl-ls --verbose

Platform [#2]:
    Version  : 1.3
    Name     : Intel(R) Level-Zero
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 1.3
        Name       : Intel(R) Iris(R) Xe Graphics
        Vendor     : Intel(R) Corporation
        Driver     : 1.3.28044
        Aspects    : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_limited_graph ext_oneapi_private_alloca
        info::device::sub_group_sizes: 8 16 32

Additional context

No response

steffenlarsen commented 5 months ago

Tag @againull for awareness. Could this be due to the known timing approximation issues?

aarongreig commented 4 months ago

I'm observing a similar problem with Basic/submit_time.cpp on linux/CL. I've found you need a bit of system load and a lot of runs to reproduce but it's consistently do-able within 20 or so iterations. An interesting data point would be whether this reproduces on cuda/hip.

On l0 and cl this could be explained by discrepancies between the timers used for the common DeviceAndHostTimer implementation they both share, which is used to cache the event's submit time here, and the separate mechanisms both adapters have for retrospectively querying out an event's start time (l0, cl).

uditagarwal97 commented 4 months ago

I'm observing a similar problem with Basic/submit_time.cpp on linux/CL. I've found you need a bit of system load and a lot of runs to reproduce but it's consistently do-able within 20 or so iterations. An interesting data point would be whether this reproduces on cuda/hip.

On l0 and cl this could be explained by discrepancies between the timers used for the common DeviceAndHostTimer implementation they both share, which is used to cache the event's submit time here, and the separate mechanisms both adapters have for retrospectively querying out an event's start time (l0, cl).

Yes, I observed a similar flaky failure in Basic/submit_time.cpp: https://github.com/intel/llvm/actions/runs/9406901188/job/25911860208?pr=14002