This code runs a host task, then enques a USM memcpy once that host task is finished. This memcpy blocks the main thread until it is completed (by waiting on the event).
However, the openCL event for the memcpy is only created (and stored in a) when the memcpy is enqueued after the host event is completed. This happens on the thread running host tasks and can essentially happen at random from the point of view of the main thread.
This can be seen by adding instrumentation to print internal values of the sycl event:
auto a = q.submit([&](sycl::handler &cgh) {
cgh.depends_on(host);
cgh.memcpy(h_data, d_data, 4);
});
a.inspect();
std::this_thread::sleep_for(std::chrono::seconds(1));
a.inspect();
a.wait();
Note that the MEvent's value is updated in the second that the main thread is paused.
a.wait() appears to branch on the value of MEvent:
if (MEvent)
// presence of MEvent means the command has been enqueued, so no need to
// go via the slow path event waiting in the scheduler
waitInternal(Success);
else if (MCommand)
detail::Scheduler::getInstance().waitForEvent(Self, Success);
This check is not guarded by any kind of mutex, meaning that MEvent could change at random in this function, causing all kinds of shenanigans.
I've been unable to exploit this to cause any issues with a vanilla sycl branch build, however it appears to be causing a failure in an e2e test in the PI->UR change. Likely due to being (un)lucky with threading timings.
To reproduce
The following diff I think proves that MEvent is being updated whilst the sycl::event is being waited on:
diff --git a/sycl/source/detail/event_impl.cpp b/sycl/source/detail/event_impl.cpp
index 097cef03b4d6..822d07fed8b5 100644
--- a/sycl/source/detail/event_impl.cpp
+++ b/sycl/source/detail/event_impl.cpp
@@ -25,6 +25,19 @@
#include <sstream>
#endif
+
+
+std::atomic<bool> checker_mutex;
+void set_checker() {
+ checker_mutex.store(true);
+}
+void clear_checker() {
+ checker_mutex.store(false);
+}
+void check_checker() {
+ assert(!checker_mutex.load());
+}
+
namespace sycl {
inline namespace _V1 {
namespace detail {
@@ -223,6 +236,7 @@ void event_impl::instrumentationEpilog(void *TelemetryEvent,
void event_impl::wait(std::shared_ptr<sycl::detail::event_impl> Self,
bool *Success) {
+ set_checker();
if (MState == HES_Discarded)
throw sycl::exception(make_error_code(errc::invalid),
"wait method cannot be used for a discarded event.");
@@ -251,6 +265,7 @@ void event_impl::wait(std::shared_ptr<sycl::detail::event_impl> Self,
#ifdef XPTI_ENABLE_INSTRUMENTATION
instrumentationEpilog(TelemetryEvent, Name, StreamID, IId);
#endif
+ clear_checker();
}
void event_impl::wait_and_throw(
diff --git a/sycl/source/detail/memory_manager.cpp b/sycl/source/detail/memory_manager.cpp
index d5934472fbef..bbba5cb5618c 100644
--- a/sycl/source/detail/memory_manager.cpp
+++ b/sycl/source/detail/memory_manager.cpp
@@ -30,6 +30,10 @@
#include <xpti/xpti_trace_framework.hpp>
#endif
+void set_checker();
+void clear_checker();
+void check_checker();
+
namespace sycl {
inline namespace _V1 {
namespace detail {
@@ -929,6 +933,8 @@ void MemoryManager::copy_usm(const void *SrcMem, QueueImplPtr SrcQueue,
sycl::detail::pi::PiEvent *OutEvent,
const detail::EventImplPtr &OutEventImpl) {
assert(SrcQueue && "USM copy must be called with a valid device queue");
+ check_checker();
+
if (!Len) { // no-op, but ensure DepEvents will still be waited on
if (!DepEvents.empty()) {
if (OutEventImpl != nullptr)
This triggers an assert.
Environment
OS: Linux Mint 21.3 Virginia (Based on Ubuntu 22.04)
Target device and vendor: Intel GPU (Although I expect it's an issue in general)
DPC++ version: Checked on 0b60f4381a80b55021bb703d39a1036cdc6ca22c and ece363d60f2138f759c46d6aeb1457d195e372ae (sycl and an old nightly).
Dependencies version:
INFO: Output filtered by ONEAPI_DEVICE_SELECTOR environment variable, which is set to opencl:*.
To see device ids, use the --ignore-device-selectors CLI option.
Describe the bug
Consider this source code:
This code runs a host task, then enques a USM memcpy once that host task is finished. This memcpy blocks the main thread until it is completed (by waiting on the event).
However, the openCL event for the memcpy is only created (and stored in
a
) when the memcpy is enqueued after the host event is completed. This happens on the thread running host tasks and can essentially happen at random from the point of view of the main thread.This can be seen by adding instrumentation to print internal values of the sycl event:
Note that the MEvent's value is updated in the second that the main thread is paused.
a.wait()
appears to branch on the value ofMEvent
:This check is not guarded by any kind of mutex, meaning that
MEvent
could change at random in this function, causing all kinds of shenanigans.I've been unable to exploit this to cause any issues with a vanilla
sycl
branch build, however it appears to be causing a failure in an e2e test in the PI->UR change. Likely due to being (un)lucky with threading timings.To reproduce
The following diff I think proves that
MEvent
is being updated whilst thesycl::event
is being waited on:This triggers an assert.
Environment
sycl
and an old nightly).[opencl:gpu] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6]
Platforms: 1 Platform [#1]: Version : OpenCL 3.0 Name : Intel(R) OpenCL Graphics Vendor : Intel(R) Corporation Devices : 1 Type : gpu Version : OpenCL 3.0 NEO Name : Intel(R) UHD Graphics 770 Vendor : Intel(R) Corporation Driver : 24.05.28454.6 UUID : 134128128701200002000000 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_srgb ext_intel_device_id ext_intel_legacy_image ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca info::device::sub_group_sizes: 8 16 32 Architecture: intel_gpu_adl_s default_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] accelerator_selector() : No device of requested type available. cpu_selector() : No device of requested type available. gpu_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] custom_selector(gpu) : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] custom_selector(cpu) : No device of requested type available. custom_selector(acc) : No device of requested type available.