oneapi-src / level-zero

oneAPI Level Zero Specification Headers and Loader
https://spec.oneapi.com/versions/latest/elements/l0/source/index.html
MIT License
197 stars 83 forks source link

Intel level zero Compute is broken for my Intel Arc A750LE with linux kernel version 6.6.27 and above #147

Closed qnixsynapse closed 3 months ago

qnixsynapse commented 3 months ago

The compute hangs with no error on the front facing application using level-zero. In the system logs however, I can see this error:

kernel: Fence expiration time out i915-0000:03:00.0:server[5816]:14c!

I suspect that the problem might related to this upstream commit .

I can also see that the number of physical "unknown" engines in intel_gpu_top has been reduced to 1 from 4. Not sure where to open the issue, so I am opening here first.

portaloffreedom commented 3 months ago

I'm experiencing a very similar issue with blender. The entire KDE locks and dmesg reports

Fence expiration time out i915-0000:44:00.0:blender[11880]:2!

if I use SYCL_PI_TRACE=-1 blender --debug-cycles only blender hangs instead of the entire UI, last command reported by PI_TRACE is

---> piQueueFinish(
        <unknown> : 0x72611e81de00

Distro: archlinux GPU: Intel Arc770 16GB Installed from packages at versions: level-zero-loader 1.16.15-1 intel-compute-runtime 24.13.29138.7-1 kernel 6.8.9-arch1-1

qnixsynapse commented 3 months ago

Mine hangs right here:

---> piextUSMEnqueueMemset(
    <unknown> : 0x4525e80
    <unknown> : 0xffffd5579d000000
    <unknown> : 0
    <unknown> : 67108864
    <unknown> : 0
    pi_event * : 0[ nullptr ]
    pi_event * : 0x454a208[ 0 ... ]
UR ---> TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)
UR <--- TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)(UR_RESULT_SUCCESS)
UR ---> Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)
UR ---> Queue->insertStartBarrierIfDiscardEventsMode(CommandList)
UR <--- Queue->insertStartBarrierIfDiscardEventsMode(CommandList)(UR_RESULT_SUCCESS)
UR <--- Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)(UR_RESULT_SUCCESS)
UR ---> createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)
UR ---> EventCreate(Queue->Context, Queue, HostVisible.value(), Event)
UR <--- EventCreate(Queue->Context, Queue, HostVisible.value(), Event)(UR_RESULT_SUCCESS)
UR ---> urEventRetain(*Event)
UR <--- urEventRetain(*Event)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)(UR_RESULT_SUCCESS)
UR ---> Queue->executeCommandList(CommandList, false, OkToBatch)
UR <--- Queue->executeCommandList(CommandList, false, OkToBatch)(UR_RESULT_SUCCESS)
) --->  pi_result : PI_SUCCESS
    [out]void * : 0xffffd5579d000000
    [out]pi_event * : 0[ nullptr ]
    [out]pi_event * : 0x454a208[ 0x4548590 ... ]

---> piEventRelease(
    pi_event : 0x4548a00
UR ---> urEventReleaseInternal(Event)
UR ---> urQueueReleaseInternal(Queue)
UR <--- urQueueReleaseInternal(Queue)(UR_RESULT_SUCCESS)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
) --->  pi_result : PI_SUCCESS

---> piEventsWait(
    <unknown> : 1
    pi_event * : 0x454a208[ 0x4548590 ... ]
UR ---> UrQueue->executeAllOpenCommandLists()
UR <--- UrQueue->executeAllOpenCommandLists()(UR_RESULT_SUCCESS)

Not sure what that means since I am not an expert in low level GPU programming.

Edit: Okay I ended up asking an LLM, so here at the end, the piEventsWait isn't properly complete with the fence expiration timeout suggesting that GPU-CPU sync is not working as expected(according to llama 3).

portaloffreedom commented 3 months ago

I found the corresponding issue on blender side: https://projects.blender.org/blender/blender/issues/120800

portaloffreedom commented 3 months ago

And found this as well: https://github.com/intel/compute-runtime/issues/726

qnixsynapse commented 3 months ago

Okay, it is a kernel module bug introduced in 6.6.26 and above.

These issues should be closed. The fix will be available within next week.