intel-gpu / intel-gpu-i915-backports

Other
94 stars 63 forks source link

kernel crash on promox 8.1 when starting VM with a VF attached #165

Closed scyto closed 6 months ago

scyto commented 7 months ago

Hi i have used the unofficial i915-dkms repo and that works fine on promox 8.1 (6.5x kernel) in windows VM

Thought i would switch over to 'official' backports driver, driver loads just fine, i see 7 VFs created.

However when i pass the vGPU through to qemu/kvm VM and start the VM i get the following bug check.

Is this a known issue? Any ideas where i should look next

[  732.971624] BUG: kernel NULL pointer dereference, address: 0000000000000188
[  732.971651] #PF: supervisor write access in kernel mode
[  732.971661] #PF: error_code(0x0002) - not-present page
[  732.971669] PGD 0 P4D 0 
[  732.971680] Oops: 0002 [#2] PREEMPT SMP NOPTI
[  732.971692] CPU: 12 PID: 10 Comm: kworker/u32:0 Tainted: P     UD    O       6.5.13-3-pve #1
[  732.971707] Hardware name: Intel(R) Client Systems NUC13ANHi7/NUC13ANBi7, BIOS ANRPL357.0031.2024.0207.1420 02/07/2024
[  732.971719] Workqueue: i915 __i915_vm_release [i915]
[  732.972156] RIP: 0010:_raw_spin_lock+0x13/0x60
[  732.972175] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 7a bb 7a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[  732.972192] RSP: 0018:ffffad06400c3cb0 EFLAGS: 00010246
[  732.972202] RAX: 0000000000000000 RBX: ffff9c254d158000 RCX: 0000000000000000
[  732.972211] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000188
[  732.972219] RBP: ffffad06400c3cd0 R08: 0000000000000000 R09: 0000000000000000
[  732.972228] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  732.972235] R13: 0000000000000188 R14: ffff9c254fc55000 R15: ffff9c254fc55000
[  732.972244] FS:  0000000000000000(0000) GS:ffff9c3497900000(0000) knlGS:0000000000000000
[  732.972255] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  732.972263] CR2: 0000000000000188 CR3: 0000000900234000 CR4: 0000000000752ee0
[  732.972272] PKRU: 55555554
[  732.972277] Call Trace:
[  732.972284]  <TASK>
[  732.972293]  ? show_regs+0x6d/0x80
[  732.972308]  ? __die+0x24/0x80
[  732.972319]  ? page_fault_oops+0x176/0x500
[  732.972334]  ? do_user_addr_fault+0x31d/0x6a0
[  732.972347]  ? exc_page_fault+0x83/0x1b0
[  732.972357]  ? asm_exc_page_fault+0x27/0x30
[  732.972374]  ? _raw_spin_lock+0x13/0x60
[  732.972387]  ? px_release+0x28/0xe0 [i915]
[  732.972789]  free_px+0x7c/0xb0 [i915]
[  732.973188]  __gen8_ppgtt_cleanup+0x3db/0x410 [i915]
[  732.973523]  ? psi_task_switch+0xd3/0x240
[  732.973535]  ? raw_spin_rq_unlock+0x10/0x40
[  732.973545]  ? finish_task_switch.isra.0+0x85/0x2c0
[  732.973554]  ? __schedule+0x404/0x1440
[  732.973567]  gen8_ppgtt_cleanup+0x3a/0x60 [i915]
[  732.973886]  __i915_vm_release+0x1a/0x40 [i915]
[  732.974233]  process_one_work+0x23b/0x450
[  732.974245]  worker_thread+0x50/0x3f0
[  732.974255]  ? __pfx_worker_thread+0x10/0x10
[  732.974265]  kthread+0xef/0x120
[  732.974272]  ? __pfx_kthread+0x10/0x10
[  732.974279]  ret_from_fork+0x44/0x70
[  732.974289]  ? __pfx_kthread+0x10/0x10
[  732.974296]  ret_from_fork_asm+0x1b/0x30
[  732.974307]  </TASK>
[  732.974311] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd veth ceph libceph fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics bonding tls qrtr softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_hda_codec_hdmi intel_uncore_frequency_common snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel snd_sof_intel_hda_mlink kvm soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp irqbypass snd_sof crct10dif_pclmul polyval_clmulni snd_sof_utils polyval_generic snd_soc_hdac_hda ghash_clmulni_intel snd_hda_ext_core sha256_ssse3 snd_soc_acpi_intel_match sha1_ssse3 snd_soc_acpi aesni_intel iwlmvm soundwire_generic_allocation soundwire_bus crypto_simd cryptd i915(O) mac80211 snd_soc_core snd_compress
[  732.974446]  libarc4 ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec btusb i915_compat(O) rapl btrtl snd_hda_core drm_display_helper btbcm snd_hwdep pmt_telemetry(O) btintel mei_hdcp(O) mei_pxp(O) cec iwlwifi asus_nb_wmi snd_pcm pmt_class(O) cmdlinepart btmtk ov13858 asus_wmi snd_timer mei_me(O) rc_core bluetooth v4l2_fwnode spi_nor ledtrig_audio snd ucsi_acpi sparse_keymap cfg80211 drm_kms_helper v4l2_async ecdh_generic intel_cstate soundcore mtd typec_ucsi ee1004 platform_profile pcspkr mei(O) wmi_bmof ecc videodev i2c_algo_bit typec intel_vsec(O) joydev input_leds mc acpi_tad acpi_pad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap thunderbolt_net drm msr efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_generic usbmouse usbkbd usbhid hid uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci nvme intel_lpss_pci xhci_pci_renesas spi_intel_pci ahci i2c_i801 intel_lpss nvme_core video
[  732.974656]  crc32_pclmul xhci_hcd thunderbolt igc spi_intel libahci i2c_smbus idma64 wmi nvme_common pinctrl_tigerlake
[  732.974722] CR2: 0000000000000188
[  732.974729] ---[ end trace 0000000000000000 ]---
[  733.209498] RIP: 0010:_raw_spin_lock+0x13/0x60
[  733.209513] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 7a bb 7a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[  733.209520] RSP: 0018:ffffad0640ee7cb0 EFLAGS: 00010246
[  733.209533] RAX: 0000000000000000 RBX: ffff9c254d324c80 RCX: 0000000000000000
[  733.209537] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000188
[  733.209540] RBP: ffffad0640ee7cd0 R08: 0000000000000000 R09: 0000000000000000
[  733.209543] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  733.209545] R13: 0000000000000188 R14: ffff9c25496f5000 R15: ffff9c25496f5000
[  733.209549] FS:  0000000000000000(0000) GS:ffff9c3497900000(0000) knlGS:0000000000000000
[  733.209553] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  733.209556] CR2: 0000000000000188 CR3: 000000010ad96000 CR4: 0000000000752ee0
[  733.209559] PKRU: 55555554
[  733.209562] note: kworker/u32:0[10] exited with irqs disabled
[  733.209595] note: kworker/u32:0[10] exited with preempt_count 1

this is the i915 startup that, at least to me indicates the driver started just fine....

root@pve1:~# dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.13-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7
[    0.079561] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.13-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7
[    6.769348] i915 0000:00:02.0: Running in SR-IOV PF mode
[    6.769388] i915 0000:00:02.0: [drm] GT count: 1, enabled: 1
[    6.769659] i915 0000:00:02.0: [drm] VT-d active for gfx access
[    6.769765] i915 0000:00:02.0: vgaarb: deactivate vga console
[    6.769781] i915 0000:00:02.0: [drm] Using Transparent Hugepages
[    6.770918] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[    6.774837] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adlp_dmc_ver2_16.bin (v2.16)
[    6.781599] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.19.2.bin version 70.19.2
[    6.781603] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc_7.9.3.bin version 7.9.3
[    6.795283] i915 0000:00:02.0: [drm] GT0: HuC: authenticated!
[    6.795609] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
[    6.795611] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
[    6.796061] i915 0000:00:02.0: [drm] GT0: GUC: RC enabled
[    6.802959] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[    6.803078] i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
[    6.825326] i915 0000:00:02.0: 7 VFs could be associated with this PF
[    6.826699] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
[    6.873545] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    7.516210] i915 0000:00:02.1: enabling device (0000 -> 0002)
[    7.516232] i915 0000:00:02.1: Running in SR-IOV VF mode
[    7.517067] i915 0000:00:02.1: GuC interface version 0.1.8.2
[    7.518047] i915 0000:00:02.1: [drm] GT count: 1, enabled: 1
[    7.518179] i915 0000:00:02.1: [drm] VT-d active for gfx access
[    7.518194] i915 0000:00:02.1: [drm] Using Transparent Hugepages
[    7.519254] i915 0000:00:02.1: GuC interface version 0.1.8.2
[    7.520779] i915 0000:00:02.1: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.520791] i915 0000:00:02.1: HuC firmware PRELOADED
[    7.530657] i915 0000:00:02.1: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.530928] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.1 on minor 1
[    7.531806] i915 0000:00:02.2: enabling device (0000 -> 0002)
[    7.531827] i915 0000:00:02.2: Running in SR-IOV VF mode
[    7.532529] i915 0000:00:02.2: GuC interface version 0.1.8.2
[    7.534220] i915 0000:00:02.2: [drm] GT count: 1, enabled: 1
[    7.534249] i915 0000:00:02.2: [drm] VT-d active for gfx access
[    7.534265] i915 0000:00:02.2: [drm] Using Transparent Hugepages
[    7.535333] i915 0000:00:02.2: GuC interface version 0.1.8.2
[    7.537066] i915 0000:00:02.2: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.537089] i915 0000:00:02.2: HuC firmware PRELOADED
[    7.547188] i915 0000:00:02.2: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.547463] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.2 on minor 2
[    7.548524] i915 0000:00:02.3: enabling device (0000 -> 0002)
[    7.548544] i915 0000:00:02.3: Running in SR-IOV VF mode
[    7.549214] i915 0000:00:02.3: GuC interface version 0.1.8.2
[    7.550674] i915 0000:00:02.3: [drm] GT count: 1, enabled: 1
[    7.550697] i915 0000:00:02.3: [drm] VT-d active for gfx access
[    7.550713] i915 0000:00:02.3: [drm] Using Transparent Hugepages
[    7.551783] i915 0000:00:02.3: GuC interface version 0.1.8.2
[    7.553461] i915 0000:00:02.3: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.553464] i915 0000:00:02.3: HuC firmware PRELOADED
[    7.563579] i915 0000:00:02.3: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.563764] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.3 on minor 3
[    7.565559] i915 0000:00:02.4: enabling device (0000 -> 0002)
[    7.565581] i915 0000:00:02.4: Running in SR-IOV VF mode
[    7.566470] i915 0000:00:02.4: GuC interface version 0.1.8.2
[    7.567327] i915 0000:00:02.4: [drm] GT count: 1, enabled: 1
[    7.567344] i915 0000:00:02.4: [drm] VT-d active for gfx access
[    7.567360] i915 0000:00:02.4: [drm] Using Transparent Hugepages
[    7.568763] i915 0000:00:02.4: GuC interface version 0.1.8.2
[    7.570522] i915 0000:00:02.4: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.570531] i915 0000:00:02.4: HuC firmware PRELOADED
[    7.581520] i915 0000:00:02.4: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.581634] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.4 on minor 4
[    7.582653] i915 0000:00:02.5: enabling device (0000 -> 0002)
[    7.582668] i915 0000:00:02.5: Running in SR-IOV VF mode
[    7.583170] i915 0000:00:02.5: GuC interface version 0.1.8.2
[    7.584860] i915 0000:00:02.5: [drm] GT count: 1, enabled: 1
[    7.584882] i915 0000:00:02.5: [drm] VT-d active for gfx access
[    7.584898] i915 0000:00:02.5: [drm] Using Transparent Hugepages
[    7.585797] i915 0000:00:02.5: GuC interface version 0.1.8.2
[    7.587316] i915 0000:00:02.5: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.587330] i915 0000:00:02.5: HuC firmware PRELOADED
[    7.597786] i915 0000:00:02.5: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.598049] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.5 on minor 5
[    7.598776] i915 0000:00:02.6: enabling device (0000 -> 0002)
[    7.598790] i915 0000:00:02.6: Running in SR-IOV VF mode
[    7.599622] i915 0000:00:02.6: GuC interface version 0.1.8.2
[    7.601334] i915 0000:00:02.6: [drm] GT count: 1, enabled: 1
[    7.601349] i915 0000:00:02.6: [drm] VT-d active for gfx access
[    7.601361] i915 0000:00:02.6: [drm] Using Transparent Hugepages
[    7.602244] i915 0000:00:02.6: GuC interface version 0.1.8.2
[    7.603761] i915 0000:00:02.6: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.603774] i915 0000:00:02.6: HuC firmware PRELOADED
[    7.614276] i915 0000:00:02.6: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.614435] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.6 on minor 6
[    7.615236] i915 0000:00:02.7: enabling device (0000 -> 0002)
[    7.615249] i915 0000:00:02.7: Running in SR-IOV VF mode
[    7.616576] i915 0000:00:02.7: GuC interface version 0.1.8.2
[    7.617958] i915 0000:00:02.7: [drm] GT count: 1, enabled: 1
[    7.617975] i915 0000:00:02.7: [drm] VT-d active for gfx access
[    7.617989] i915 0000:00:02.7: [drm] Using Transparent Hugepages
[    7.619123] i915 0000:00:02.7: GuC interface version 0.1.8.2
[    7.620780] i915 0000:00:02.7: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[    7.620794] i915 0000:00:02.7: HuC firmware PRELOADED
[    7.631371] i915 0000:00:02.7: [drm] Protected Xe Path (PXP) protected content support initialized
[    7.631535] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.7 on minor 7
[    7.632212] i915 0000:00:02.0: Enabled 7 VFs
smuqthya commented 7 months ago

@scyto what is the unofficial repo you are referring to?

KMD backport is intended for Discrete platforms only and we do not test igpu with dkms and neither claim to support.