intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.74k stars 1.27k forks source link

Upstream i915 GUC load failed on Ubuntu 24.04 Kernel 6.8.0-31 with Arc A770 #12122

Open huiwangnick opened 2 months ago

huiwangnick commented 2 months ago

OS Ubuntu 24.04 Kernel 6.8.0-31-generic

Error message

[   18.393201] i915 0000:0d:00.0: [drm] BAR2 resized to 16384M
[   18.393290] i915 0000:0d:00.0: [drm] Local memory IO size: 0x00000003fa000000
[   18.393296] i915 0000:0d:00.0: [drm] Local memory available: 0x00000003fa000000
[   18.409701] i915 0000:0d:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   18.422225] i915 0000:0d:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[   18.430827] i915 0000:0d:00.0: [drm] GT0: GUC: ADS capture alloc size changed from 32768 to 36864
[   18.431927] i915 0000:0d:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[   18.431930] i915 0000:0d:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
[   18.432046] i915 0000:0d:00.0: [drm] GT0: GUC: ADS capture alloc size changed from 32768 to 36864
[   18.432614] i915 0000:0d:00.0: [drm] GT0: GUC: load failed: status = 0x40000056, time = 0ms, freq = 2400MHz, ret = 0
[   18.432617] i915 0000:0d:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x2B, UKernel = 0x00, MIA = 0x00, Auth = 0x01
[   18.432619] i915 0000:0d:00.0: [drm] GT0: GUC: firmware production part check failure
[   18.432684] i915 0000:0d:00.0: [drm] *ERROR* GT0: GuC initialization failed -ENOEXEC
[   18.432688] i915 0000:0d:00.0: [drm] *ERROR* GT0: Enabling uc failed (-5)
[   18.432690] i915 0000:0d:00.0: [drm] *ERROR* 
GT0: Failed to initialize GPU, declaring it wedged!
[   18.434676] i915 0000:0d:00.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_set_wedged_on_init+0x34/0x50 [i915]

intel-gpu/intel-gpu-i915-backports#194

qiyuangong commented 1 month ago

Hi @huiwangnick

From https://github.com/intel-gpu/intel-gpu-i915-backports/issues/194 log, you are using out of tree driver (version 1.24.4.12.240603.18.6.8.0.40+i1-1, package name intel-dmabuf-drm-i915-dkms) rather than upstream driver.

Meanwhile, the driver version used is different from our recommendation (i.e., intel-i915-dkms in https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#for-linux-kernel-65).

Can you share your driver installation steps and commands?

huiwangnick commented 1 month ago

Hi @qiyuangong

I apologize for not being clearer earlier. In the issue discussed in intel-gpu/intel-gpu-i915-backports#194, I am using the out-of-tree driver; however, in this case, I am utilizing the upstream driver from the Linux kernel. I have also attempted your recommendation of using intel-i915-dkms, but all the drivers exhibit the same GUC load failure issue that I mentioned earlier.

qiyuangong commented 1 month ago

Hi @qiyuangong

I apologize for not being clearer earlier. In the issue discussed in intel-gpu/intel-gpu-i915-backports#194, I am using the out-of-tree driver; however, in this case, I am utilizing the upstream driver from the Linux kernel. I have also attempted your recommendation of using intel-i915-dkms, but all the drivers exhibit the same GUC load failure issue that I mentioned earlier.

That's fine. :)

First of all, please check GPU installation (PCIe and power), the BIOS config and firmware of these GPUs. It seems they cannot be correctly initialized. We previous encountered GUC related errors when GPU is not correctly installed.

[   18.422225] i915 0000:0d:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[   18.430827] i915 0000:0d:00.0: [drm] GT0: GUC: ADS capture alloc size changed from 32768 to 36864
[   18.431927] i915 0000:0d:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[   18.431930] i915 0000:0d:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.15
[   18.432046] i915 0000:0d:00.0: [drm] GT0: GUC: ADS capture alloc size changed from 32768 to 36864
[   18.432614] i915 0000:0d:00.0: [drm] GT0: GUC: load failed: status = 0x40000056, time = 0ms, freq = 2400MHz, ret = 0
[   18.432617] i915 0000:0d:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x2B, UKernel = 0x00, MIA = 0x00, Auth = 0x01
[   18.432619] i915 0000:0d:00.0: [drm] GT0: GUC: firmware production part check failure
[   18.432684] i915 0000:0d:00.0: [drm] *ERROR* GT0: GuC initialization failed -ENOEXEC
[   18.432688] i915 0000:0d:00.0: [drm] *ERROR* GT0: Enabling uc failed (-5)
[   18.432690] i915 0000:0d:00.0: [drm] *ERROR* GT0: Failed to initialize GPU, declaring it wedged!

Then , 6.8.0 kernel (with its upstream driver version) is not recommended by ipex-llm. It requires higher level-zero and oneAPI versions. Using 6.8 will encounter level zero mismatch with our recommended package. Please change to the recommended kernel version, i.e., 6.5.0.

huiwangnick commented 1 month ago

Thank you. I will try 6.5.0 kernel to see if I can get it to work.

In the meantime, I’d like to add that when the BAR is not resized to 16,384 MB and remains at 256 MB, the driver loads successfully. This suggests that the issue is not related to GPU installation or GPU firmware. It seems possible that the Resizable BAR feature could be causing this problem.

[   18.110784] i915 0000:0d:00.0: [drm] Failed to resize BAR2 to 16384M (-ENOSPC)
[   18.110792] i915 0000:0d:00.0: BAR 2 [mem 0x13ffe0000000-0x13ffefffffff 64bit pref]: assigned
[   18.110938] i915 0000:0d:00.0: [drm] Local memory IO size: 0x0000000010000000
[   18.110943] i915 0000:0d:00.0: [drm] Local memory available: 0x00000003fa000000
[   18.110946] i915 0000:0d:00.0: [drm] Using a reduced BAR size of 256MiB. Consider enabling 'Resizable BAR' or similar, if available in the BIOS.
[   18.132145] i915 0000:0d:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[   18.141268] i915 0000:0d:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[   18.141273] i915 0000:0d:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[   18.149102] i915 0000:0d:00.0: [drm] GT0: GUC: submission enabled
[   18.149104] i915 0000:0d:00.0: [drm] GT0: GUC: SLPC enabled
[   18.149347] i915 0000:0d:00.0: [drm] GT0: GUC: RC enabled
[   18.177938] [drm] Initialized i915 1.6.0 20230929 for 0000:0d:00.0 on minor 1
qiyuangong commented 1 month ago

Thank you. I will try 6.5.0 kernel to see if I can get it to work.

In the meantime, I’d like to add that when the BAR is not resized to 16,384 MB and remains at 256 MB, the driver loads successfully. This suggests that the issue is not related to GPU installation or GPU firmware. It seems possible that the Resizable BAR feature could be causing this problem.

[   18.110784] i915 0000:0d:00.0: [drm] Failed to resize BAR2 to 16384M (-ENOSPC)
[   18.110792] i915 0000:0d:00.0: BAR 2 [mem 0x13ffe0000000-0x13ffefffffff 64bit pref]: assigned
[   18.110938] i915 0000:0d:00.0: [drm] Local memory IO size: 0x0000000010000000
[   18.110943] i915 0000:0d:00.0: [drm] Local memory available: 0x00000003fa000000
[   18.110946] i915 0000:0d:00.0: [drm] Using a reduced BAR size of 256MiB. Consider enabling 'Resizable BAR' or similar, if available in the BIOS.
[   18.132145] i915 0000:0d:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[   18.141268] i915 0000:0d:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[   18.141273] i915 0000:0d:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[   18.149102] i915 0000:0d:00.0: [drm] GT0: GUC: submission enabled
[   18.149104] i915 0000:0d:00.0: [drm] GT0: GUC: SLPC enabled
[   18.149347] i915 0000:0d:00.0: [drm] GT0: GUC: RC enabled
[   18.177938] [drm] Initialized i915 1.6.0 20230929 for 0000:0d:00.0 on minor 1

OK. If it's Resizable BAR (Base Address Register) related issue. Please check the BIOS config. It's recommended to enable this feature for ARC.

https://www.intel.com/content/www/us/en/support/articles/000090831/graphics.html