intel-gpu / intel-gpu-i915-backports

Other
94 stars 63 forks source link

I915-23.10.54 GPU hangs on Ubuntu 22.04/Kernel 6.5 with Multi-ARC770 #193

Open qiyuangong opened 2 months ago

qiyuangong commented 2 months ago

OS Ubuntu 22.04 Kernel 6.5.0-35-generic

Install version

[    4.226457] Loading modules backported from I915-23.10.54
[    4.226463] Backport generated by backports.git I915_23.10.54_PSB_231129.55

Error message

Sep  8 21:20:10 ws-arc-002 systemd[1]: Started libcontainer container 7c6b96e3f0d651ca2147dffae00d2b6ad336456cf0a47794c8d8fc8c8a389509.
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945436] BUG: kernel NULL pointer dereference, address: 00000000000000c8
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945460] #PF: supervisor read access in kernel mode
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945468] #PF: error_code(0x0000) - not-present page
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945474] PGD 339269067 P4D 33926a067 PUD 33926b067 PMD 0
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945483] Oops: 0000 [#1] PREEMPT SMP NOPTI
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945491] CPU: 28 PID: 38003 Comm: python Tainted: G           OE      6.5.0-35-generic #35~22.04.1-Ubuntu
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945502] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 2.1b 05/28/2024
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945509] RIP: 0010:lru_gen_eviction+0x10f/0x1d0
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945524] Code: d2 48 09 c2 48 83 fe 04 0f 87 a5 00 00 00 45 0f b6 e4 4b 8d 84 a0 95 00 00 00 f0 4d 01 bc c5 88 00 00 00 85 db 0f 95 c0 66 90 <0f> b7 89 c8 00 00 00 48 c1 e2 10 0f b6 c0 48 be 00 00 ff ff ff ff
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945540] RSP: 0000:ff5610a90fe7f5d0 EFLAGS: 00010046
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945547] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945554] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945561] RBP: ff5610a90fe7f610 R08: 0000000000000000 R09: 0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945568] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945575] R13: ff263a68c0152000 R14: ff263aa83ffd4000 R15: 0000000000000001
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945582] FS:  000078fa497f8640(0000) GS:ff263aa740100000(0000) knlGS:0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945596] CR2: 00000000000000c8 CR3: 0000002abee86002 CR4: 0000000000771ee0
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945603] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  8 21:46:02 ws-arc-002 kernel: [ 2744.945610] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
smuqthya commented 2 months ago

@qiyuangong Request you to share outputs "dmesg -r " , "dkms status" .
This issue was observed while executing on i-gpu but not on d-gpu. can you please confirm

qiyuangong commented 2 months ago

dkms status

AUXILIARY_BUS is enabled for 6.5.0-35-generic.
intel-i915-dkms/1.23.10.54.231129.55, 6.5.0-35-generic, x86_64: installedAUXILIARY_BUS is enabled for 6.5.0-35-generic.

dmesg -r dmesg.log

smuqthya commented 2 months ago

@qiyuangong Can i know what is the usecase you are looking for

Note: https://github.com/intel-gpu/intel-gpu-i915-backports?tab=readme-ov-file#intel-graphics-driver-backports-for-linux-os-intel-gpu-i915-backports

For Alchemist discrete Graphics cards, support is provided without display. This repo can be used for the features like GPU debug functionality. For normal cases, please use upstream 6.2 or later kernel version.

qiyuangong commented 2 months ago

@qiyuangong Can i know what is the usecase you are looking for

Note: https://github.com/intel-gpu/intel-gpu-i915-backports?tab=readme-ov-file#intel-graphics-driver-backports-for-linux-os-intel-gpu-i915-backports

For Alchemist discrete Graphics cards, support is provided without display. This repo can be used for the features like GPU debug functionality. For normal cases, please use upstream 6.2 or later kernel version.

We use this driver for LLM-related debugging and benchmarking. The performance of OOT driver is better than 6.2-6.5 upstream driver.