intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 734 forks source link

Possible race condition when waiting for USM events dependant on host tasks #14623

Closed RossBrunton closed 1 month ago

RossBrunton commented 3 months ago

Describe the bug

Consider this source code:

  sycl::queue q{};

  float *d_data = (float *)sycl::malloc_host(64, q.get_context());
  float *h_data = (float *)sycl::malloc_host(64, q.get_context());

  for(size_t i = 0; i < 1000; i++) {
    std::this_thread::sleep_for(std::chrono::seconds(1));

    auto host = q.submit([&](sycl::handler &cgh) {
      cgh.host_task([&] {});
    });

    auto a = q.submit([&](sycl::handler &cgh) {
      cgh.depends_on(host);
      cgh.memcpy(h_data, d_data, 4);
    });

    a.wait();
  }

  sycl::free(h_data, q);
  sycl::free(d_data, q);

  return 0;

This code runs a host task, then enques a USM memcpy once that host task is finished. This memcpy blocks the main thread until it is completed (by waiting on the event).

However, the openCL event for the memcpy is only created (and stored in a) when the memcpy is enqueued after the host event is completed. This happens on the thread running host tasks and can essentially happen at random from the point of view of the main thread.

This can be seen by adding instrumentation to print internal values of the sycl event:

    auto a = q.submit([&](sycl::handler &cgh) {
      cgh.depends_on(host);
      cgh.memcpy(h_data, d_data, 4);
    });
    a.inspect();
    std::this_thread::sleep_for(std::chrono::seconds(1));
    a.inspect();

    a.wait();
Inspecting event @0x1f46830
> MEvent: 0
> MIsHostEvent: 0
> MCommand: 0x1f465c0
> MState: 0
> MPreparedDepsEvents:
> MPreparedHostDepsEvents:
>> 0x1f45f30
Inspecting event @0x1f46830
> MEvent: 0x7f1cec001040
> MIsHostEvent: 0
> MCommand: 0
> MState: 0
> MPreparedDepsEvents:
> MPreparedHostDepsEvents:
>> 0x1f45f30

Note that the MEvent's value is updated in the second that the main thread is paused.

a.wait() appears to branch on the value of MEvent:

  if (MEvent)
    // presence of MEvent means the command has been enqueued, so no need to
    // go via the slow path event waiting in the scheduler
    waitInternal(Success);
  else if (MCommand)
    detail::Scheduler::getInstance().waitForEvent(Self, Success);

This check is not guarded by any kind of mutex, meaning that MEvent could change at random in this function, causing all kinds of shenanigans.

I've been unable to exploit this to cause any issues with a vanilla sycl branch build, however it appears to be causing a failure in an e2e test in the PI->UR change. Likely due to being (un)lucky with threading timings.

To reproduce

The following diff I think proves that MEvent is being updated whilst the sycl::event is being waited on:

diff --git a/sycl/source/detail/event_impl.cpp b/sycl/source/detail/event_impl.cpp
index 097cef03b4d6..822d07fed8b5 100644
--- a/sycl/source/detail/event_impl.cpp
+++ b/sycl/source/detail/event_impl.cpp
@@ -25,6 +25,19 @@
 #include <sstream>
 #endif

+
+
+std::atomic<bool> checker_mutex;
+void set_checker() {
+  checker_mutex.store(true);
+}
+void clear_checker() {
+  checker_mutex.store(false);
+}
+void check_checker() {
+  assert(!checker_mutex.load());
+}
+
 namespace sycl {
 inline namespace _V1 {
 namespace detail {
@@ -223,6 +236,7 @@ void event_impl::instrumentationEpilog(void *TelemetryEvent,

 void event_impl::wait(std::shared_ptr<sycl::detail::event_impl> Self,
                       bool *Success) {
+  set_checker();
   if (MState == HES_Discarded)
     throw sycl::exception(make_error_code(errc::invalid),
                           "wait method cannot be used for a discarded event.");
@@ -251,6 +265,7 @@ void event_impl::wait(std::shared_ptr<sycl::detail::event_impl> Self,
 #ifdef XPTI_ENABLE_INSTRUMENTATION
   instrumentationEpilog(TelemetryEvent, Name, StreamID, IId);
 #endif
+  clear_checker();
 }

 void event_impl::wait_and_throw(
diff --git a/sycl/source/detail/memory_manager.cpp b/sycl/source/detail/memory_manager.cpp
index d5934472fbef..bbba5cb5618c 100644
--- a/sycl/source/detail/memory_manager.cpp
+++ b/sycl/source/detail/memory_manager.cpp
@@ -30,6 +30,10 @@
 #include <xpti/xpti_trace_framework.hpp>
 #endif

+void set_checker();
+void clear_checker();
+void check_checker();
+
 namespace sycl {
 inline namespace _V1 {
 namespace detail {
@@ -929,6 +933,8 @@ void MemoryManager::copy_usm(const void *SrcMem, QueueImplPtr SrcQueue,
                              sycl::detail::pi::PiEvent *OutEvent,
                              const detail::EventImplPtr &OutEventImpl) {
   assert(SrcQueue && "USM copy must be called with a valid device queue");
+  check_checker();
+
   if (!Len) { // no-op, but ensure DepEvents will still be waited on
     if (!DepEvents.empty()) {
       if (OutEventImpl != nullptr)

This triggers an assert.

Environment

[opencl:gpu] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6]

Platforms: 1 Platform [#1]: Version : OpenCL 3.0 Name : Intel(R) OpenCL Graphics Vendor : Intel(R) Corporation Devices : 1 Type : gpu Version : OpenCL 3.0 NEO Name : Intel(R) UHD Graphics 770 Vendor : Intel(R) Corporation Driver : 24.05.28454.6 UUID : 134128128701200002000000 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_srgb ext_intel_device_id ext_intel_legacy_image ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca info::device::sub_group_sizes: 8 16 32 Architecture: intel_gpu_adl_s default_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] accelerator_selector() : No device of requested type available. cpu_selector() : No device of requested type available. gpu_selector() : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] custom_selector(gpu) : gpu, Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO [24.05.28454.6] custom_selector(cpu) : No device of requested type available. custom_selector(acc) : No device of requested type available.



### Additional context

_No response_
0x12CC commented 3 months ago

I'm able to reproduce this locally on CPU using 16e39df495a0b69b17a6610760577f3858370264.

AlexeySachkov commented 3 months ago

I wonder if #14613 is exactly about the same issue

againull commented 1 month ago

Fixed in https://github.com/intel/llvm/pull/15179