Closed qnixsynapse closed 3 months ago
I'm experiencing a very similar issue with blender. The entire KDE locks and dmesg
reports
Fence expiration time out i915-0000:44:00.0:blender[11880]:2!
if I use SYCL_PI_TRACE=-1 blender --debug-cycles
only blender hangs instead of the entire UI, last command reported by PI_TRACE is
---> piQueueFinish(
<unknown> : 0x72611e81de00
Distro: archlinux GPU: Intel Arc770 16GB Installed from packages at versions: level-zero-loader 1.16.15-1 intel-compute-runtime 24.13.29138.7-1 kernel 6.8.9-arch1-1
Mine hangs right here:
---> piextUSMEnqueueMemset(
<unknown> : 0x4525e80
<unknown> : 0xffffd5579d000000
<unknown> : 0
<unknown> : 67108864
<unknown> : 0
pi_event * : 0[ nullptr ]
pi_event * : 0x454a208[ 0 ... ]
UR ---> TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)
UR <--- TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)(UR_RESULT_SUCCESS)
UR ---> Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)
UR ---> Queue->insertStartBarrierIfDiscardEventsMode(CommandList)
UR <--- Queue->insertStartBarrierIfDiscardEventsMode(CommandList)(UR_RESULT_SUCCESS)
UR <--- Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)(UR_RESULT_SUCCESS)
UR ---> createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)
UR ---> EventCreate(Queue->Context, Queue, HostVisible.value(), Event)
UR <--- EventCreate(Queue->Context, Queue, HostVisible.value(), Event)(UR_RESULT_SUCCESS)
UR ---> urEventRetain(*Event)
UR <--- urEventRetain(*Event)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)(UR_RESULT_SUCCESS)
UR ---> Queue->executeCommandList(CommandList, false, OkToBatch)
UR <--- Queue->executeCommandList(CommandList, false, OkToBatch)(UR_RESULT_SUCCESS)
) ---> pi_result : PI_SUCCESS
[out]void * : 0xffffd5579d000000
[out]pi_event * : 0[ nullptr ]
[out]pi_event * : 0x454a208[ 0x4548590 ... ]
---> piEventRelease(
pi_event : 0x4548a00
UR ---> urEventReleaseInternal(Event)
UR ---> urQueueReleaseInternal(Queue)
UR <--- urQueueReleaseInternal(Queue)(UR_RESULT_SUCCESS)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
) ---> pi_result : PI_SUCCESS
---> piEventsWait(
<unknown> : 1
pi_event * : 0x454a208[ 0x4548590 ... ]
UR ---> UrQueue->executeAllOpenCommandLists()
UR <--- UrQueue->executeAllOpenCommandLists()(UR_RESULT_SUCCESS)
Not sure what that means since I am not an expert in low level GPU programming.
Edit: Okay I ended up asking an LLM, so here at the end, the piEventsWait isn't properly complete with the fence expiration timeout suggesting that GPU-CPU sync is not working as expected(according to llama 3).
I found the corresponding issue on blender side: https://projects.blender.org/blender/blender/issues/120800
And found this as well: https://github.com/intel/compute-runtime/issues/726
Okay, it is a kernel module bug introduced in 6.6.26 and above.
These issues should be closed. The fix will be available within next week.
The compute hangs with no error on the front facing application using level-zero. In the system logs however, I can see this error:
I suspect that the problem might related to this upstream commit .
I can also see that the number of physical "unknown" engines in intel_gpu_top has been reduced to 1 from 4. Not sure where to open the issue, so I am opening here first.