Appending a dependency event which is to be signalled from the host causes hangs

pvelesko commented 1 year ago

I create an event which (in the full application) will be signalled from a different thread. I enqueue a kernel which depends on this event, and when using immediate command lists, the enqueue command itself hangs.

Expected behavior: kernel enqueue returns but the kernel doesn't run until we HostSignal the blocking event. callbacks.zip

Using regular command lists, I don't see the host hang on the kernel enqueue

pvelesko commented 1 year ago

Driver version 23.17.26241.22 tested on Intel HD 770 Intel(R) Arc(TM) A380 Graphics

Sarbojit2019 commented 1 year ago

@pvelesko,

I looked into the sample (callback.zip) attached here. It looks like there is programming error because of which code is getting into dead lock hence the hang. I have corrected the code in main.cpp file and test passed with immediate command list as well.

Immediate command list tries to complete the dispatch immediatly but in the attached sample it has a wait event hence dispatch call waits for event to complete but event signal call has been kept after it hence dispatch never completes which shows as hang. When I moved launch to separate thread test passed. See the code change below

  L1 // Launch kernel on the GPU
  L2 std::cout << "Launching kernel\n";

  L3 // Launching kernel using worker thread
  L4 // In case of immediate command list it is required to be on other thread else it will deadlock  
  L5 std::thread kernel_launch(kernel_func, cmdList, kernel, &dispatch, Event, 1, &HostSignalEvent);

  L6 //ZE_CHECK(zeCommandListAppendLaunchKernel(cmdList, kernel, &dispatch,
  L7 //                                             Event, 0, nullptr)); //1, &HostSignalEvent));
  L8 std::cout << "Host Signal Blocking Event\n";
  L9 ZE_CHECK(zeEventHostSignal(HostSignalEvent));   

  L10 auto begin = std::chrono::steady_clock::now();
  L11 kernel_launch.join();

L5 where I am launching the kernel in other thread where it will be waiting for event to finish. L9 signals the event which helps kernel_launch to complete.

Let me know if you have any further questions.

pvelesko commented 1 year ago

  cmdQueueDesc.mode = ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS;

enumerator ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS
Device execution is scheduled and will complete in future; explicit synchronization object must be used to determine completeness

submitting things to an async command queue should not be blocking operations.

pjaaskel commented 1 year ago

https://spec.oneapi.io/level-zero/latest/core/PROG.html#low-latency-immediate-command-lists: "Commands appended into an immediate command list may execute synchronously, by blocking until the command is complete."

pjaaskel commented 1 year ago

But yep, I'd hope the "blocking possibility" doesn't include also waiting for events the command depends on. It'd cause the cumbersome need to have another thread just to construct the queue in case of the kernel command has event dependencies of any kind? Looks like a spec corner case which we can workaround (if needed) with the separate submit thread as suggested by @Sarbojit2019.

pvelesko commented 1 year ago

I assume it's blocking when a sync queue is requested

On Mon, Jul 10, 2023 at 13:48 Pekka Jääskeläinen @.***> wrote:

But yep, I'd hope the "blocking possibility" doesn't include also waiting for events the command depends on. It'd cause the cumbersome need to have another thread just to construct the queue in case of the kernel command has event dependencies of any kind? Looks like a spec corner case which we can workaround (if needed) with the separate submit thread as suggested by @Sarbojit2019 https://github.com/Sarbojit2019.

— Reply to this email directly, view it on GitHub https://github.com/intel/compute-runtime/issues/658#issuecomment-1628699211, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCJBQPMD5XXJRO6FA7CIY3XPPMY3ANCNFSM6AAAAAAZ3LCYW4 . You are receiving this because you were mentioned.Message ID: @.***>

pjaaskel commented 1 year ago

I assume that it might execute the command before returning (to achieve lowest possible latency), but not deadlock on an input event dep (based on the manual link I posted).

pvelesko commented 1 year ago

Looks like a spec corner case

The spec is ambiguous in this regard. It would make sense if

"Commands appended into an immediate command list may execute synchronously, by blocking until the command is complete."

is true when ZE_COMMAND_QUEUE_MODE_SYNCHRONOUS is set:

"Device execution always completes immediately on execute; Host thread is blocked using wait on implicit synchronization object"

@pjaaskel

which we can workaround (if needed) with the separate submit thread as suggested by

We can just use regular command lists for callbacks to get around this but I'm seeing hangs in the full application and can't reproduce in a simple reproducer. Furthermore, I'm seeing deadlocks on non-blocking operations such as zeEventQuery but I wanted to get clarity on this issue first.

pjaaskel commented 1 year ago

With ZE_COMMAND_QUEUE_MODE_SYNCHRONOUS it says always blocks whereas with immediate lists it says may block, which have a major semantics difference. But anyhow, I'd also not expect blocking in the case of event deps here.

pvelesko commented 1 year ago

and conversely ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS should not be blocking.

"enumerator ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS Device execution is scheduled and will complete in future; explicit synchronization object must be used to determine completeness"

pjaaskel commented 1 year ago

Yep, I'd expect "async usage" to work in this case even though it might execute the commands in the same thread (sometimes).

jandres742 commented 1 year ago

hi

I checked the test and is valid. Test is submitting a kernel for immediate submission in asynchronous way. This means upon returning from appendLaunchKernel, the kernel has been submitted, and then asynchronously will wait for the event, but the host thread calling for appendLaunchKernel shouldn't block.

Bacdtrace on GPU MAX 1550 shows we are stuck on the submission path:

#0  0x00007ffff7b3fcab in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007ffff6a5a4cc in __gthread_yield () at /usr/include/x86_64-linux-gnu/c++/10/bits/gthr-default.h:693
#2  std::this_thread::yield () at /usr/include/c++/10/thread:379
#3  NEO::WaitUtils::waitFunctionWithPredicate<unsigned long>(unsigned long const volatile*, unsigned long, std::function<bool (unsigned long, unsigned long)>) (predicate=..., expectedValue=1, pollAddress=0xfffffff7fc5000)
    at ../shared/source/utilities/wait_util.h:33
#4  NEO::WaitUtils::waitFunction (expectedValue=1, pollAddress=0xfffffff7fc5000) at ../shared/source/utilities/wait_util.h:38
#5  NEO::CommandStreamReceiver::baseWaitFunction (this=0x55555624d6d0, pollAddress=0xfffffff7fc5000, params=..., taskCountToWait=1) at ../shared/source/command_stream/command_stream_receiver.cpp:445
#6  0x00007ffff6762def in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3080>::executeCommandListImmediateWithFlushTaskImpl (this=this@entry=0x55555627b5f0, performMigration=<optimized out>, performMigration@entry=true,
    hasStallingCmds=hasStallingCmds@entry=true, hasRelaxedOrderingDependencies=hasRelaxedOrderingDependencies@entry=false, cmdQ=<optimized out>) at ../level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:350
#7  0x00007ffff6762ffa in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3080>::executeCommandListImmediateWithFlushTask (this=this@entry=0x55555627b5f0, performMigration=performMigration@entry=true, hasStallingCmds=hasStallingCmds@entry=true,
    hasRelaxedOrderingDependencies=hasRelaxedOrderingDependencies@entry=false) at ../level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:283
#8  0x00007ffff6763077 in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3080>::flushImmediate (this=0x55555627b5f0, inputRet=<optimized out>, performMigration=<optimized out>, hasStallingCmds=<optimized out>,
    hasRelaxedOrderingDependencies=<optimized out>, hSignalEvent=0x555556284410) at ../level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:844
#9  0x00007ffff66a61ce in L0::zeCommandListAppendLaunchKernel (hCommandList=<optimized out>, kernelHandle=<optimized out>, launchKernelArgs=<optimized out>, hSignalEvent=<optimized out>, numWaitEvents=<optimized out>, phWaitEvents=<optimized out>)
    at ../level_zero/core/source/cmdlist/cmdlist.h:188
#10 0x00007ffff7eea188 in zeCommandListAppendLaunchKernel () from /usr/local/lib/libze_loader.so.1
#11 0x000055555555c3b8 in main (argc=1, argv=0x7fffffffe288) at /home/gta/bin/callbacks/main.cpp:224

The call to waitFunction seems suspicious. If the list is asynchronous, L0 driver shouldn't be waiting.

Test passes with EnableFlushTaskSubmission=0, so it seems we have a bug on that path.

@JablonskiMateusz : Please let me know if an internal tracker has been created for this issue or I can do it.

pvelesko commented 12 months ago

@JablonskiMateusz @jandres742 EnableFlushTaskSubmission=0 does not seem to make any difference. Is this supposed to work on public oneapi releases?

jandres742 commented 12 months ago

@pvelesko : that's a debug key, which needs to be set alongside with NEOReadDebugKeys:

export NEOReadDebugKeys=1
export EnableFlushTaskSubmission=0

pvelesko commented 12 months ago

Is there any danger in always having these set? @jandres742

zzdanowicz commented 12 months ago

I think the reason is that the kernel has printf(). Can you remove it and try it again?

zzdanowicz commented 12 months ago

Is there any danger in always having these set? @jandres742

@pvelesko - Yes. Applications should not set debug keys, as using them is only for development and debug purposes.

jandres742 commented 12 months ago

thanks @zzdanowicz . You are correct. Once the printf is removed from KernelGPU.cl, the test completes:

$ ./driver
Using immediate command list
Device   : Intel(R) Data Center GPU Max 1550
Type     : GPU
Vendor ID: 8086
#Queue Groups: 3
Group X: 1024
Group Y: 1
Launching kernel
Host Signal Blocking Event
Kernel Event Query: 1298518
GPU Kernel = 104273027 [ns]
SEQ Kernel = 5406017312 [ns]
Speedup = 51x

Matrix Multiply validation PASSED

@pvelesko could you confirm on your side? Basically, we added a WA where kernels with printfs are executed in synchronous way to make sure that printfs are printed in order when multiple kernels are submitted to an immediate command list. This is just a temporary WA while we work in a more permanent solution, but as long as there are no printfs, the code won't hang.

pvelesko commented 12 months ago

@pvelesko could you confirm on your side?

@jandres742 removing the printf no longer blocks the enqueue.

This is just a temporary WA while we work in a more permanent solution, but as long as there are no printfs, the code won't hang

Any ETA on full solution?

In the meantime,

Yes. Applications should not set debug keys, as using them is only for development and debug purposes.

what side effects can I expect from using these?

jandres742 commented 12 months ago

Thanks @pvelesko for confirming. We have an internal PR where we are removing that WA. So ETA for that PR to be merged is 1 week I'd say, and then it should be available in upcoming release, provided we dont find any regressions.

The debug keys are only for experimentation, so the suggestion above about using them was just to confirm you were seeing the same behavior as us. However, these debug keys are not fully validated and not recommended for use in productization scripts or releases.

In that sense, current disclaimer regarding this would be something like:

“Known issue: Kernels with printf submitted to immediate command list are executed synchronously to ensure proper ordering of strings printed. This is to be fixed with long term solution in L0 driver”."

zzdanowicz commented 12 months ago

what side effects can I expect from using these?

You cannot have any quality expectations. Applications may fail, get data errors, crashes or hangs.

intel / compute-runtime

Appending a dependency event which is to be signalled from the host causes hangs #658