Flex 140 hangs and reset fails

flumm commented 6 months ago

Hi,

i have some weird stability issue, and wanted to ask if that seems like a software or hardware issue, and how/if we can fix that.

I started a VM with QEMU/KVM with a VF of a Flex 140 with Windows 11. That alone worked fine, drivers in the guest installed ok, device manager + task manager reported everything ok.

I could start Heaven Unigine benchmark, which showed ~140 FPS on low settings (1280x720) After some time though, it dropped to 1 FPS but the task manager still showed 100% utilization.

I played around with rebooting, disabling/enabling the device in device manager, but i got the following logs on the host dmesg:

May 02 12:09:43 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0: GUC: Engine reset request failed on 0:0 (rcs0) because 0x0, GDRST = 0x00000000
May 02 12:09:48 server kernel: i915 0000:33:00.0: GPU HANG: ecode 12:0:00000000
May 02 12:09:48 server kernel: i915 0000:33:00.0: [drm] Resetting chip for GuC failed to reset rcs0 (reason=0x00000000)
May 02 12:09:49 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [ENGINE OTHER] rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
May 02 12:09:49 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [ENGINE OTHER] rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
May 02 12:09:49 server kernel: i915 0000:33:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.19.2.bin version 70.19.2
May 02 12:09:50 server kernel: i915 0000:33:00.0: [drm] GT0: GUC: excessive init time: 899ms! [freq = 100MHz, before = 100MHz, status = 0x8002F034, count = 0, ret = 0]
May 02 12:09:50 server kernel: i915 0000:33:00.0: [drm] GT0: GUC: submission enabled
May 02 12:09:50 server kernel: i915 0000:33:00.0: [drm] GT0: GUC: SLPC enabled

on trying to remove the virtual functions via sysfs and unloading the driver i got:

May 02 12:14:37 server kernel: [drm:wait_for_ct_request_update [i915]] *ERROR* CT: fence 168 err -62
May 02 12:14:37 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: No response for request 0x5503 (fence 168)
May 02 12:14:37 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: Sending action 0x5503 failed (-ETIME)
May 02 12:14:37 server kernel: i915 0000:30:00.0: [drm] IOV0: Failed to push configurations (-ESTALE)
May 02 12:14:38 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 3
May 02 12:14:39 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 4
May 02 12:14:40 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 5
May 02 12:14:41 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 6
May 02 12:14:42 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 7
May 02 12:14:43 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 8
May 02 12:14:44 server kernel: i915 0000:30:00.0: [drm] tlb invalidation response timed out for seqno 9
May 02 12:14:44 server kernel: i915 0000:30:00.0: Disabled 2 VFs
May 02 12:14:47 server kernel: [drm:wait_for_ct_request_update [i915]] *ERROR* CT: fence 169 err -62
May 02 12:14:47 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: No response for request 0x5506 (fence 169)
May 02 12:14:47 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: Sending action 0x5506 failed (-ETIME)
May 02 12:14:47 server kernel: i915 0000:30:00.0: [drm] IOV0: Failed to start FLR for VF2 (-ETIME)
May 02 12:14:58 server kernel: i915 0000:33:00.0: [drm] IOV0: VF1 FLR didn't complete within 500 ms
May 02 12:14:58 server kernel: i915 0000:33:00.0: [drm] IOV0: VF2 FLR didn't complete within 250 ms
May 02 12:15:02 server kernel: i915 0000:30:00.0: GPU HANG: ecode 12:0:00000000
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] Resetting chip for stopped heartbeat on bcs0
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.19.2.bin version 70.19.2
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: load failed: status = 0x40000056, time = 0ms, freq = 1900MHz, ret = 0
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x2B, UKernel = 0x00, MIA = 0x00, Auth = 0x01
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: firmware production part check failure
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: load failed: status = 0x40000056, time = 0ms, freq = 1900MHz, ret = 0
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x2B, UKernel = 0x00, MIA = 0x00, Auth = 0x01
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] GT0: GUC: firmware production part check failure
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0: GuC initialization failed -ENOEXEC
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0: Enabling uc failed (-5)
May 02 12:15:02 server kernel: i915 0000:30:00.0: [drm] *ERROR* GT0 [GT OTHER] Failed to initialise HW following reset (-5)
May 02 12:15:07 server kernel: [drm:wait_for_ct_request_update [i915]] *ERROR* CT: fence 235 err -62
May 02 12:15:07 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: No response for request 0x5506 (fence 235)
May 02 12:15:07 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: Sending action 0x5506 failed (-ETIME)
May 02 12:15:07 server kernel: i915 0000:33:00.0: [drm] IOV0: Failed to start FLR for VF1 (-ETIME)
May 02 12:15:18 server kernel: [drm:wait_for_ct_request_update [i915]] *ERROR* CT: fence 236 err -62
May 02 12:15:18 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: No response for request 0x5503 (fence 236)
May 02 12:15:18 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: Sending action 0x5503 failed (-ETIME)
May 02 12:15:18 server kernel: i915 0000:33:00.0: [drm] IOV0: Failed to push configurations (-ESTALE)
May 02 12:15:19 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 3
May 02 12:15:20 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 4
May 02 12:15:21 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 5
May 02 12:15:22 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 6
May 02 12:15:23 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 7
May 02 12:15:24 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 8
May 02 12:15:25 server kernel: i915 0000:33:00.0: [drm] tlb invalidation response timed out for seqno 9
May 02 12:15:25 server kernel: i915 0000:33:00.0: Disabled 2 VFs
May 02 12:15:28 server kernel: [drm:wait_for_ct_request_update [i915]] *ERROR* CT: fence 237 err -62
May 02 12:15:28 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: No response for request 0x5506 (fence 237)
May 02 12:15:28 server kernel: i915 0000:33:00.0: [drm] *ERROR* GT0 [GUC COMMUNICATION] CT: Sending action 0x5506 failed (-ETIME)
May 02 12:15:28 server kernel: i915 0000:33:00.0: [drm] IOV0: Failed to start FLR for VF2 (-ETIME)
May 02 12:16:19 server kernel: Deleting MTD partitions on "i915.spi.12288":
May 02 12:16:19 server kernel: Deleting i915.spi.12288.DESCRIPTOR MTD partition
May 02 12:16:19 server kernel: Deleting i915.spi.12288.GSC MTD partition
May 02 12:16:19 server kernel: Deleting i915.spi.12288.OptionROM MTD partition
May 02 12:16:19 server kernel: Deleting i915.spi.12288.DAM MTD partition
May 02 12:16:24 server kernel: Deleting MTD partitions on "i915.spi.13056":
May 02 12:16:24 server kernel: Deleting i915.spi.13056.DESCRIPTOR MTD partition
May 02 12:16:24 server kernel: Deleting i915.spi.13056.GSC MTD partition
May 02 12:16:24 server kernel: Deleting i915.spi.13056.OptionROM MTD partition
May 02 12:16:24 server kernel: Deleting i915.spi.13056.DAM MTD partition
May 02 12:16:28 server kernel: COMPAT BACKPORTED EXIT

Any idea what could cause that?

lleo19 commented 6 months ago

Hi Flumm, I am testing a Flex 140 as well, PVE 8.2 with kernel 6.5.3-13-pve Windows 10 VM with latest driver. Heaven Unigine benchmark runs OK I had stability issues until I updated the firmware on the Flex 140 If you are like me, and never used the GPU in a stand-alone setup, the firmware may be original from factory. IIRC Intel GPUs are updated only when the user drivers are installed.

flumm commented 6 months ago

How did you update the firmware exactly? Does that only work with windows drivers? Is there any way to see which firmware is on there?

Any help appreciated, since the docs i could find are rather sparse ;)

lleo19 commented 6 months ago

yes, they are... get the IGSC utility from https://github.com/intel/igsc I built it under PVE Than get the latest windows driver, and unpack it under Windows (i.e. start the install...) in your temp directory find the firmware and opcode folders and the firmware files in them. transfer those to your Flex system, and update with the igsc utility there is an additional file for the opcode-data, but that is not required. This can only be sourced from intel or your oem vendor

lleo19 commented 6 months ago

forgot to mention that the igsc utility does allow you to check what version you have and to downgrade, but does not allow you to back up your current firmware...

lleo19 commented 6 months ago

of course, you can 'simply' install windows bare-metal on the system that has the Flex, and when installing the driver, that will take care of everything for you. I believe the same is included in the supported linux binary drivers from intel repos, but not in the backport. I have not done either of the above... Note that you have to update both GPU separately on the Flex 140, as there are two... My Flex 140 has the following:

root@epyc:/usr/src/igsc/src# ./igsc fw version --device /dev/mei1
Device: FW Version: DG02_2.2353
root@epyc:/usr/src/igsc/src# ./igsc fw-data version --device /dev/mei1
Device: Fw Data Version: Major Version: 101, OEM Manufacturing Data Version: 291, Major VCN: 1
root@epyc:/usr/src/igsc/src# ./igsc oprom-code version --device /dev/mei1
OPROM CODE Version: 14 00 2C 04 00 00 00 00
root@epyc:/usr/src/igsc/src# ./igsc oprom-data version --device /dev/mei1
OPROM DATA Version: 14 00 24 04 00 00 00 00

Maybe someone from Intel could comment if this is the latest...

flumm commented 6 months ago

just to update in the meantime, it seems it was not a firmware issue what i had, but a thermal one

i tried passing the card through to a windows vm (i though maybe the driver can upgrade the firmware this way, but no) and i saw that the cards were in the 90 degree celcius range (idling), so i increased the fan speed, and since then it ran stable

i'll eventually come around to updating the firmware, but it seems it's not necessary for me at the moment

for the record, my firmware version currently is: Device: FW Version: DG02_2.2273

smuqthya commented 5 months ago

Any action from us.

flumm commented 5 months ago

Any action from us.

while my issues seemed to disappear with proper cooling, could you check the logs i posted if that's intended and normal behaviour in that case? normally i'd expect hardware to either work slowly or crash outright when not cooled properly but the weird hangs/reset seemed off

if that is the intended/normal behavior, you can close the issue ofc

also it would be nice if there would be another official way to obtain firmware upgrades besides installing the windows driver (and checking if there is newer firmware altogether) but this is only tangentially related to this issue (is there a better place to request/report that?)

thanks

smuqthya commented 5 months ago

@flumm igsc tool is an cross platform tool so same can be used for linux. please check repo document link.

https://github.com/intel/igsc

flumm commented 5 months ago

thanks for responding.

yes the tool to flash the firmware is clear and that seems to work. My issue was how to get the updated firmware? I did not find any intel site that would mention that, so the only way currently is to start the windows install and extract the files from the temporary dir there? (or am i missing something here?)

smuqthya commented 3 months ago

Ubuntu Package InstallationÁ The kernel and xpu-smi packages can be installed on a bare metal system. Installation on the host is sufficient for hardware management and support of the runtimes in containers and bare metal.

sudo apt install -y \ linux-headers-$(uname -r) \ linux-modules-extra-$(uname -r) \ flex bison \ intel-fw-gpu intel-i915-dkms xpu-smi sudo reboot

Compute and Full instructions: https://dgpu-docs.intel.com/driver/installation.html

flumm commented 3 months ago

thanks for the answer, but I'm not sure how that relates to my question. I wanted to know where can i get the firmware files, besides extracting them from the windows driver? or is it enough to load the latest one from https://github.com/intel-gpu/intel-gpu-firmware ?

alexzuointel commented 3 months ago

Take Ubuntu as the example. intel-fw-gpu contains the latest FWs.

1.4.3.2. Ubuntu Package InstallationÁ The kernel and xpu-smi packages can be installed on a bare metal system. Installation on the host is sufficient for hardware management and support of the runtimes in containers and bare metal. sudo apt install -y \ linux-headers-$(uname -r) \ linux-modules-extra-$(uname -r) \ flex bison \ intel-fw-gpu intel-i915-dkms xpu-smi sudo reboot Compute and Full instructions: https://dgpu-docs.intel.com/driver/installation.html

intel-gpu / intel-gpu-i915-backports

Flex 140 hangs and reset fails #172