Open makoONE opened 4 months ago
Is anyone else experiencing this problem?
Is it a reason not to move to 6.8 yet?
John T Davis @.***
On Jul 20, 2024, at 7:44 AM, makoONE @.***> wrote:
I am delighted that kernel 6.8 is now supported. Unfortunately, however, freezes occur after a short time under Proxmox with kernel 6.8.8.3-pve. The error entry with dmesg is:
[ 800.513559] BUG: kernel NULL pointer dereference, address: 000000000000057d [800.513567] #PF: supervisor read access in kernel mode [ 800.513569] #PF: error_code(0x0000) - not-present page
It would be great if the error could be fixed, thanks.
— Reply to this email directly, view it on GitHub https://github.com/strongtz/i915-sriov-dkms/issues/182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGI5CYTTSLWYDTY64JS3S2LZNJLTRAVCNFSM6AAAAABLF5EFOGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDAOBYGQZDCNI. You are receiving this because you are subscribed to this thread.
6.8.8-2-pve has been stable for me for over 24 hours (i9-13900H - Minisforum MS-01). I haven't tried 6.8.8.3-pve yet.
I'm running the module with the 6.8.8-2-pve kernel without problems so far on Raptorlake Refresh hardware (ASUS W680 + i9-14900K) and also on some Beelink Mini EQ 12 (Alderlake N100). In the meantime, even all seven VFs were in use at the same time on the large machine for testing multiple Windows RDP sessions with 4K video decoding and 3D, Debian 12 and Ubuntu 22.04 with VA API and 3D acceleration in parallel. But maybe I was just lucky. 🫤
This repo is currently based on lts-v6.1.26-linux-230504T201607Z from linux-intel-lts which was merged by @zhtengw on May 8, 2023. Afterwards there seem to be mostly build fixes. Looking at the diff (git diff --stat lts-v6.1.26-linux-230504T201607Z lts-v6.1.95-linux-240708T112901Z drivers/gpu/drm/i915
) between lts-v6.1.26-linux-230504T201607Z and the latest lts-v6.1.95-linux-240708T112901Z, there is a whopping number of 440 files changed, 32882 insertions(+), 20631 deletions(-). The commit history also contains a noticable amount of "fixes". If someone would have the time, and would be both able and willing to merge those changes into this repo, chances would likely be high that this addresses your issue, @makeONE. I don't know how much effort it would require to do this.
I've really just recently decided to give Intel SR-IOV a try (it was on the todo list for at least a year 😄) and did the attempt to fix the build for 6.8 kernels only since I feared that this repo might not get updated anymore (@strongtz seems to have stopped using it). As mentioned in PR #178, I'm also not a kernel/drm developer and lack any experience in debugging kernel modules efficiently, sorry. But even some senior veteran drm developer would have to reproduce/understand such issues a bit more in detail (more logs, stacktrace, ...).
Just FYI (see below), there is already another issue on my hardware with some out of bounds (OOB) error on Ubuntu 24.04 guest VM and kernel 6.8 if the module is initialized as minor 0 instead of minor 1. It is also worth to mention that some memory leaks have been reported in GH-175.
Given that Intel is bringing SR-IOV support with its new xe driver maybe in kernel 6.12 (or later)
this repo might need to survive only for a few more months or ~1 year, I guess (Intel delayed the release for quite some time now but Arrow Lake will be released in Q4 2024). On the other hand, it'll take time until whatever xe-sriov kernel lands in new linux distros. :grin: At least Proxmox seems to be quite fast with new kernel releases.
So is it really worth merging lts-v6.1.95-linux-240708T112901Z and is there anyone who'd be able to it?
I have the said freezes by kernel null pointer dereference also with the Proxmox kernel 6.8.8-2-pve and 6.8.4-3-pve in connection with a Win11 VM and the vGPU usage. The same VM runs otherwise with the Proxmox 6.5 kernel without any problems.
I sadly won't be able to help you with debugging/updating the module. @JTR-Tech and I are using the same Raptor Lake-S UHD Graphics with device ID A780. The issue might be limited to your Alder Lake-P UHD Graphics with device ID 46A3, though my little N100 Alder Lake-N processor with UHD Graphics ID 46D1 does not seem to freeze. But it could also be other effects like memory limits or the VM driver. I'm using the latest driver 32.0.101.5762 for the Windows VMs with hardware settings as following:
My VM configuration is largely the same and I am also using the latest Intel Graphics Driver 32.0.101.5762 in the Win11 VM.
I just found out that the current i915-sriov-dkms state 2024.07.19 does not work with a 6.5 kernel (6.5.13-5-pve) anymore, same error kernel null pointer dereference.
Probably there is no other choice but to go back to i915-sriov-dkms state before 2024.07.17, i.e. without kernel 6.8 support.
Before 2024.07.17, kernel 6.5.13-5-pve was not working due to build issues. It was impossible to build and only 6.5.13-3-pve compiled. The build issues with 6.5.13-5-pve have been fixed with #178 / #179. I have tested and was running all kernels 6.5.13-3-pve, 6.5.13-5-pve and 6.8.4-3-pve and 6.8.8-2-pve without issues afterwards.
Also please note that there was no functional change for Proxmox with #178 and #179 except for the firmware version. The only thing that changed was header includes for kernel 6.8.* and a fix for 6.5.13-5-pve that would have prevented building the module anyway.
It is technically impossible that the module change from 6.1 to 2024.07.17 or 2024.07.19 has changed anything with the 6.5.13-3-pve kernel or the 6.5.13-5-pve kernel (which was broken anyway) unless it is related to the firmware version.
What is your output of dmesg | grep i915
after boot?
Output of dmesg | grep i915
after boot with current state of i915-sriov-dkms and kernel 6.8.8-3-pve:
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.8-3-pve root=/dev/mapper/nab6--vg-root ro quiet mitigations=off intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 cpufreq.default_governor=powersave [ 0.101347] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.8-3-pve root=/dev/mapper/nab6--vg-root ro quiet mitigations=off intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 cpufreq.default_governor=powersave use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 1.681479] i915: loading out-of-tree module taints kernel. [ 1.681510] i915: module verification failed: signature and/or required key missing - tainting kernel [ 1.857258] i915 0000:00:02.0: Running in SR-IOV PF mode [ 1.857828] i915 0000:00:02.0: [drm] VT-d active for gfx access [ 1.875918] i915 0000:00:02.0: vgaarb: deactivate vga console [ 1.875967] i915 0000:00:02.0: [drm] Using Transparent Hugepages [ 1.876334] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=io+mem [ 1.877907] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adlp_dmc.bin (v2.20) [ 1.886258] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/adlp_guc_70.bin version 70.20.0 [ 1.886261] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3 [ 1.900528] i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads! [ 1.901378] i915 0000:00:02.0: [drm] GT0: GUC: submission enabled [ 1.901379] i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled [ 1.901804] i915 0000:00:02.0: [drm] GuC RC: enabled [ 1.902284] i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized [ 1.932033] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 1 [ 1.933160] i915 0000:00:02.0: 7 VFs could be associated with this PF [ 1.966144] fbcon: i915drmfb (fb0) is primary device [ 1.966148] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device [ 3.999089] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915]) [ 6.016961] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.017020] i915 0000:00:02.1: enabling device (0000 -> 0002) [ 6.017033] i915 0000:00:02.1: Running in SR-IOV VF mode [ 6.017463] i915 0000:00:02.1: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.018848] i915 0000:00:02.1: [drm] VT-d active for gfx access [ 6.018875] i915 0000:00:02.1: [drm] Using Transparent Hugepages [ 6.019215] i915 0000:00:02.1: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.019992] i915 0000:00:02.1: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.019993] i915 0000:00:02.1: HuC firmware PRELOADED [ 6.022863] i915 0000:00:02.1: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.022866] i915 0000:00:02.1: [drm] PMU not supported for this GPU. [ 6.023043] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.1 on minor 0 [ 6.023230] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.023233] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.023280] i915 0000:00:02.2: enabling device (0000 -> 0002) [ 6.023288] i915 0000:00:02.2: Running in SR-IOV VF mode [ 6.023761] i915 0000:00:02.2: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.024884] i915 0000:00:02.2: [drm] VT-d active for gfx access [ 6.024895] i915 0000:00:02.2: [drm] Using Transparent Hugepages [ 6.025180] i915 0000:00:02.2: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.025752] i915 0000:00:02.2: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.025754] i915 0000:00:02.2: HuC firmware PRELOADED [ 6.027423] i915 0000:00:02.2: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.027426] i915 0000:00:02.2: [drm] PMU not supported for this GPU. [ 6.027574] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.2 on minor 2 [ 6.027801] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.027804] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.027806] i915 0000:00:02.2: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.027852] i915 0000:00:02.3: enabling device (0000 -> 0002) [ 6.027864] i915 0000:00:02.3: Running in SR-IOV VF mode [ 6.027976] i915 0000:00:02.3: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.028215] i915 0000:00:02.3: [drm] VT-d active for gfx access [ 6.028226] i915 0000:00:02.3: [drm] Using Transparent Hugepages [ 6.028476] i915 0000:00:02.3: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.028844] i915 0000:00:02.3: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.028845] i915 0000:00:02.3: HuC firmware PRELOADED [ 6.030471] i915 0000:00:02.3: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.030475] i915 0000:00:02.3: [drm] PMU not supported for this GPU. [ 6.030630] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.3 on minor 3 [ 6.030827] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.030830] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.030832] i915 0000:00:02.2: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.030835] i915 0000:00:02.3: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.030887] i915 0000:00:02.4: enabling device (0000 -> 0002) [ 6.030898] i915 0000:00:02.4: Running in SR-IOV VF mode [ 6.031023] i915 0000:00:02.4: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.031258] i915 0000:00:02.4: [drm] VT-d active for gfx access [ 6.031268] i915 0000:00:02.4: [drm] Using Transparent Hugepages [ 6.031522] i915 0000:00:02.4: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.031847] i915 0000:00:02.4: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.031848] i915 0000:00:02.4: HuC firmware PRELOADED [ 6.033169] i915 0000:00:02.4: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.033172] i915 0000:00:02.4: [drm] PMU not supported for this GPU. [ 6.033319] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.4 on minor 4 [ 6.033520] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.033523] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.033525] i915 0000:00:02.2: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.033528] i915 0000:00:02.3: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.033530] i915 0000:00:02.4: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.033578] i915 0000:00:02.5: enabling device (0000 -> 0002) [ 6.033587] i915 0000:00:02.5: Running in SR-IOV VF mode [ 6.033762] i915 0000:00:02.5: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.033976] i915 0000:00:02.5: [drm] VT-d active for gfx access [ 6.033986] i915 0000:00:02.5: [drm] Using Transparent Hugepages [ 6.034233] i915 0000:00:02.5: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.034575] i915 0000:00:02.5: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.034577] i915 0000:00:02.5: HuC firmware PRELOADED [ 6.035756] i915 0000:00:02.5: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.035759] i915 0000:00:02.5: [drm] PMU not supported for this GPU. [ 6.035803] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.5 on minor 5 [ 6.035988] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.035990] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.035993] i915 0000:00:02.2: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.035995] i915 0000:00:02.3: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.035998] i915 0000:00:02.4: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.036000] i915 0000:00:02.5: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.036040] i915 0000:00:02.6: enabling device (0000 -> 0002) [ 6.036052] i915 0000:00:02.6: Running in SR-IOV VF mode [ 6.036226] i915 0000:00:02.6: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.036457] i915 0000:00:02.6: [drm] VT-d active for gfx access [ 6.036468] i915 0000:00:02.6: [drm] Using Transparent Hugepages [ 6.036698] i915 0000:00:02.6: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.036978] i915 0000:00:02.6: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.036979] i915 0000:00:02.6: HuC firmware PRELOADED [ 6.038153] i915 0000:00:02.6: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.038156] i915 0000:00:02.6: [drm] PMU not supported for this GPU. [ 6.038197] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.6 on minor 6 [ 6.038370] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem [ 6.038372] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.038375] i915 0000:00:02.2: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.038377] i915 0000:00:02.3: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.038379] i915 0000:00:02.4: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.038382] i915 0000:00:02.5: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none [ 6.038384] i915 0000:00:02.6: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none use xe.force_probe='46a3' and i915.force_probe='!46a3' [ 6.038426] i915 0000:00:02.7: enabling device (0000 -> 0002) [ 6.038436] i915 0000:00:02.7: Running in SR-IOV VF mode [ 6.038639] i915 0000:00:02.7: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.038902] i915 0000:00:02.7: [drm] VT-d active for gfx access [ 6.038924] i915 0000:00:02.7: [drm] Using Transparent Hugepages [ 6.039221] i915 0000:00:02.7: [drm] GT0: GUC: interface version 0.1.9.0 [ 6.039536] i915 0000:00:02.7: GuC firmware PRELOADED version 1.9 submission:SR-IOV VF [ 6.039538] i915 0000:00:02.7: HuC firmware PRELOADED [ 6.040794] i915 0000:00:02.7: [drm] Protected Xe Path (PXP) protected content support initialized [ 6.040798] i915 0000:00:02.7: [drm] PMU not supported for this GPU. [ 6.040853] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.7 on minor 7 [ 6.041004] i915 0000:00:02.0: Enabled 7 VFs
If you want to test the previously allowed guc firmware minor version (0 or 4) instead of the corrected minor version 9. You could just run the following in the root of your the i915-sriov-dkms repo:
sed -i 's/GUCFIRMWARE_MINOR:-9/GUCFIRMWARE_MINOR:-0/' Makefile
rm -rf /usr/src/i915-sriov-dkms-* /var/lib/dkms/i915-sriov-dkms
dkms add .
dkms install -m i915-sriov-dkms -v $(cat VERSION) -k 6.8.8-2-pve --force
This will change GUCFIRMWARE_MINOR=9
to GUCFIRMWARE_MINOR=0
in the Makefile.
After successful compilation and reboot the dmesg | grep i915
should then contain error messages like *ERROR* GT0: IOV: Unable to confirm version 1.9 (0000000000000000)
and *ERROR* GT0: IOV: Found interface version 0.1.9.0
(see below) but interestingly the VFs still seem to work in Windows. It seems those errors could just be ignored and that was the reason why it worked before (confirmed by using vainfo
and intel_gpu_top -d sriov
on the host). 🤔 Could you kindly check if that fixes your freezes?
I followed your advice with the modified minor version, dmesg says:
[ 5.863414] i915 0000:00:02.3: [drm] ERROR GT0: IOV: Unable to confirm version 1.9 (0000000000000000) [5.863572] i915 0000:00:02.3: [drm] ERROR GT0: IOV: Found interface version 0.1.9.0
Unfortunately the machine continues to freeze.
[86.393495] BUG: kernel NULL pointer dereference, address: 000000000000057d [86.393502] #PF: supervisor read access in kernel mode [ 86.393503] #PF: error_code(0x0000) - not-present page
my miniWorkStation (Minisforum MS-01 - i9-13900H), on Proxmox VE 8.2.4 x86_64 (6.8.8-3-pve), using version 2024.07.19, I watch the Youtube 8K video for 10 minutes and the GPU never freezes, I checked dmesg without any errors, its works perfectly fine.
I followed your advice with the modified minor version, dmesg says:
[ 5.863414] i915 0000:00:02.3: [drm] ERROR GT0: IOV: Unable to confirm version 1.9 (0000000000000000) [5.863572] i915 0000:00:02.3: [drm] ERROR GT0: IOV: Found interface version 0.1.9.0
Unfortunately the machine continues to freeze.
[86.393495] BUG: kernel NULL pointer dereference, address: 000000000000057d [86.393502] #PF: supervisor read access in kernel mode [ 86.393503] #PF: error_code(0x0000) - not-present page
I'm running out of ideas but if even 6.5.13-3-pve is not working with sed -i 's/GUCFIRMWARE_MINOR:-9/GUCFIRMWARE_MINOR:-0/' Makefile
. Below is a comparison of i915-sriov-dkms 6.1 vs. 2024.07.19 for Proxmox VE:
Available via git reset --hard 42b49ff
0 or 4, incorrect, hardcoded
Available via git reset --hard 21a2f4a1
(or git reset --hard master
atm.)
9, correct, can be changed via GUCFIRMWARE_MINOR
e.g., sed -i 's/GUCFIRMWARE_MINOR:-9/GUCFIRMWARE_MINOR:-0/' Makefile
So either your freezes are related to the 6.8 kernel or to the corrected firmware version in combination with your UHD Device ID 46A3 (i7-12650H). The UHD Device ID A780 (i9-13900H, i9-14900K) is reportedly working fine with all kernel variants.
But the i915-sriov-dkms module version 2024.07.19 with sed -i 's/GUCFIRMWARE_MINOR:-9/GUCFIRMWARE_MINOR:-0/' Makefile
will compile effectively the same binary for 6.5.13-3-pve in version 6.1, which was the only working pve kernel in the list.
If you still have freezes with this combination and it was working before with 6.1, it must be related to something else.
6.8.8-2-pve has been stable for me for over 24 hours (i9-13900H - Minisforum MS-01). I haven't tried 6.8.8.3-pve yet.
Update for the sake of completeness:
Proxmox crashed around the 30-hour mark. It was running headless, so I'm not sure what the error was (nothing showing in the journal), which is roughly in line with the behavior I've seen on 8.5 PVE kernels.
7 VFs were applied, with only 1 assigned to a Windows 11 Pro guest using Intel Xe driver version 31.0.101.5590. The VM was idle with Windows Sleep mode disabled.
I have succesfully finished a 24 hour 3D + 4k video test running in 2 Windows 11 VMs in parallel without issues on my Promox VE 8.2 host with kernel 6.8.8-2-pve (I don't want to see the next electricity bill). And the host with an ASUS W680 + i9-14900K is running absolutely fine for 6 days now. Really no problems at all. It needs some new BIOS thanks to Intel and must therefore be rebooted now.
My dmesg got filled with a lot of messages like i915 0000:00:02.0: VF{i} FLR
in the meantime. That seems to be some Function Level Reset related to SR-IOV. Not sure if this is OK but at least I could not see any effect.
The small N100 is also running with the loaded module in kernel 6.8.8-2 for several days now but it was just idling without any VF in use.
Not sure why this is different for my two machines. I'm running both machines with the kernel command line potion split_lock_detect=off i915.enable_fbc=1 i915.enable_guc=3 i915.max_vfs=7 ignore_msrs=1 report_ignored_msrs=0
- not sure if some options are related? At least the i9-13900H should be very similar to mine. Don't know, sorry.
I will post an update whenever I encounter any kind of issue with the 6.8 kernel.
Thanks for taking the time to test this and post an update. I’m really confident to try making the switch to 6.8 this weekend. (I need to make sure the latest version of kernel 6.8.x is actually installed correctly on Proxmox now. Since I’ve had 6.5.13-5 pinned, I get scary warnings every time it does a system update about a dpkg-configure failure.)
“My dmesg got filled with a lot of messages like i915 0000:00:02.0: VF{i} FLR in the meantime. That seems to be some Function Level Reset related to SR-IOV. Not sure if this is OK https://gist.github.com/scyto/e4e3de35ee23fdb4ae5d5a3b85c16ed3?permalink_comment_id=4714186#gistcomment-4714186 but at least I could not see any effect.”
I see these all the time on a working Proxmox 8.2.x (6.5.13-5)-based install on an HP Elite Mini 600 G9. It isn’t associated with any sort of performance glitches or issues on my system. I’ve just been ignoring it.
John T Davis @.***
On Jul 23, 2024, at 5:25 PM, pasbec @.***> wrote:
I have succesfully finished a 24 hour 3D + 4k video test running in 2 Windows 11 VMs in parallel without issues on my Promox VE 8.2 host with kernel 6.8.8-2-pve (I don't want to see the next electricity bill). And the host with an ASUS W680 + i9-14900K is running absolutely fine for 6 days now. Really no problems at all. It needs some new BIOS thanks to Intel https://www.youtube.com/watch?v=wkrOYfmXhIc and must therefore be rebooted now.
My dmesg got filled with a lot of messages like i915 0000:00:02.0: VF{i} FLR in the meantime. That seems to be some Function Level Reset related to SR-IOV. Not sure if this is OK https://gist.github.com/scyto/e4e3de35ee23fdb4ae5d5a3b85c16ed3?permalink_comment_id=4714186#gistcomment-4714186 but at least I could not see any effect.
The small N100 is also running with the loaded module in kernel 6.8.8-2 for several days now but it was just idling without any VF in use.
Not sure why this is different for my two machines. I'm running both machines with the kernel command line potion split_lock_detect=off i915.enable_fbc=1 i915.enable_guc=3 i915.max_vfs=7 ignore_msrs=1 report_ignored_msrs=0 - not sure if some options are related? At least the i9-13900H should be very similar to mine. Don't know, sorry.
I will post an update whenever I encounter any kind of issue with the 6.8 kernel.
grafik.png (view on web) https://github.com/user-attachments/assets/9feee9f0-5702-446b-9099-b9a784bf4474 grafik.png (view on web) https://github.com/user-attachments/assets/ecf28c3b-4ed1-4716-a014-6bb49a22f83d grafik.png (view on web) https://github.com/user-attachments/assets/ced7828b-6f6a-4ad9-bbaa-ec6aaaf770ca dmesg (shortened) — Reply to this email directly, view it on GitHub https://github.com/strongtz/i915-sriov-dkms/issues/182#issuecomment-2246412131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGI5CYVONQAF74I5B4N4PX3ZN3J4HAVCNFSM6AAAAABLF5EFOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBWGQYTEMJTGE. You are receiving this because you commented.
I carried out further tests with my previous setup and came to the following conclusion: With kernel 6.2.16-20-pve it runs unconditionally stable. With kernel 6.5.13-5-pve and various 6.8.x-pve kernels, freezes occur early or late. Through pasbec's last post (thanks for that) I became aware of the kernel command line options “split_lock_detect=off ignore_msrs=1 report_ignored_msrs=0”, which I had not used before. I included them and after that it has been running stable with the current kernel 6.8.8-3-pve for hours. However, I now see lots of dmesg entries like this:
[ 527.228951] ------------[ cut here ]------------
[ 527.228953] i915 0000:00:02.0: drm_WARN_ON(plane_state->ggtt_vma)
[ 527.228989] WARNING: CPU: 2 PID: 130 at /var/lib/dkms/i915-sriov-dkms/2024.07.19/build/drivers/gpu/drm/i915/display/intel_atomic_plane.c:135 intel_plane_destroy_state+0x93/0xe0 [i915]
[ 527.229097] Modules linked in: veth vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd dm_snapshot joydev input_leds hid_apple hid_generic usbkbd usbmouse usbhid hid apple_mfi_fastcharge ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_hda_codec_hdmi soundwire_intel snd_sof_intel_hda_mlink kvm_intel soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp kvm snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core irqbypass mt7921e snd_soc_acpi_intel_match crct10dif_pclmul mt7921_common snd_soc_acpi polyval_clmulni soundwire_generic_allocation polyval_generic ghash_clmulni_intel mt792x_lib soundwire_bus sha256_ssse3 mt76_connac_lib sha1_ssse3 snd_soc_core aesni_intel snd_hda_codec_realtek mt76
[ 527.229127] snd_compress ac97_bus crypto_simd snd_hda_codec_generic cryptd snd_pcm_dmaengine snd_hda_intel mac80211 snd_intel_dspcfg snd_intel_sdw_acpi btusb snd_hda_codec btrtl btintel snd_hda_core btbcm snd_hwdep btmtk cmdlinepart snd_pcm cfg80211 spi_nor snd_timer rapl bluetooth pcspkr intel_cstate wmi_bmof libarc4 mei_me snd ecdh_generic mtd ecc ee1004 soundcore mei mei_vsc_hw intel_pmc_core intel_vsec pmt_telemetry pmt_class acpi_tad acpi_pad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap parport_pc ppdev lp parport efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq i915(OE) dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xe drm_gpuvm drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm drm_display_helper xhci_pci nvme xhci_pci_renesas crc32_pclmul nvme_core i2c_i801 spi_intel_pci xhci_hcd ahci cec igc spi_intel i2c_smbus libahci nvme_auth rc_core video wmi
[ 527.229175] CPU: 2 PID: 130 Comm: kworker/2:1 Tainted: P U W OE 6.8.8-3-pve #1
[ 527.229177] Hardware name: Micro Computer(HK) Tech Limited Venus series/AHBNB, BIOS 1.0S 05/11/2024
[ 527.229178] Workqueue: events intel_atomic_helper_free_state_worker [i915]
[ 527.229264] RIP: 0010:intel_plane_destroy_state+0x93/0xe0 [i915]
[ 527.229363] Code: 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 96 2d 63 c6 48 c7 c1 d8 9b d0 c0 4c 89 ea 48 c7 c7 ac 94 d3 c0 48 89 c6 e8 8d 54 c4 c5 <0f> 0b 48 83 bb e0 00 00 00 00 74 89 49 8b 04 24 48 8b 78 08 4c 8b
[ 527.229365] RSP: 0018:ffffbd948058fd88 EFLAGS: 00010246
[ 527.229366] RAX: 0000000000000000 RBX: ffff9f722a751400 RCX: 0000000000000000
[ 527.229367] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 527.229367] RBP: ffffbd948058fda0 R08: 0000000000000000 R09: 0000000000000000
[ 527.229368] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f722911f000
[ 527.229369] R13: ffff9f72022124f0 R14: 0000000000000005 R15: 0000000000000004
[ 527.229369] FS: 0000000000000000(0000) GS:ffff9f816f500000(0000) knlGS:0000000000000000
[ 527.229370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 527.229371] CR2: 00007e7956b96028 CR3: 0000000116438000 CR4: 0000000000f52ef0
[ 527.229372] PKRU: 55555554
[ 527.229372] Call Trace:
[ 527.229373]
Is it possible to find out the cause of this or what is the best way to fix the errors?
Oh what a pity, a few hours later another freeze with kernel 6.8.8-3-pve. So keep searching and hoping...
Sorry to hear that. You mentioned initially that kernel 6.5 has also been working before. Maybe it is worth to check if kernel 6.5.13-3-pve is working stable with the latest versions 2024.07.19 or 2024.07.24 (haven't tried the latter) of the dkms module. If not, you should test kernel 6.5.13-3-pve with the old version git reset --hard 42b49ff
6.1 of the dkms module again (make sure to delete /usr/src/i915-sriov-dkms before re-adding. If kernel 6.5.13-3-pve is freezing and it has not done that before, it must be something else.
I too am unable to get a i7-12700K to work with a 6.8.12 kernel. 6.5.11 used to work fine, and going back to it, even with the most current version of the module, still works. One interesting thing I noticed is that even if uninstall this module -- which seems to work because I get a message about 'max_vfs' being an unknown parameter -- I still get the NULL pointer deref with 6.8 kernel.
```
[ 4.314982] Setting dangerous option enable_guc - tainting kernel
[ 4.314984] i915: unknown parameter 'max_vfs' ignored
[ 4.316075] i915 0000:00:02.0: [drm] VT-d active for gfx access
[ 4.316079] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 4.316082] fbcon: Taking over console
[ 4.316084] #PF: supervisor read access in kernel mode
[ 4.316086] #PF: error_code(0x0000) - not-present page
[ 4.316087] PGD 0 P4D 0
[ 4.316088] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4.316090] CPU: 12 PID: 896 Comm: (udev-worker) Tainted: P U O 6.8.12-1-pve #1
[ 4.316092] Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3802 08/08/2024
[ 4.316094] RIP: 0010:kernfs_find_and_get_ns+0x16/0x80
[ 4.316097] Code: ac fd ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 d6 41 55 49 89 f5 41 54 53 <48> 8b 47 08 48 89 fb 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83 c4 60
[ 4.316099] RSP: 0018:ffffb0a00090b488 EFLAGS: 00010246
[ 4.316101] RAX: 0000000000000000 RBX: ffffffff830f1d20 RCX: 0000000000000000
[ 4.316102] RDX: 0000000000000000 RSI: ffffffff830f1e68 RDI: 0000000000000000
[ 4.316103] RBP: ffffb0a00090b4a8 R08: 0000000000000000 R09: 0000000000000000
[ 4.316104] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d2dc32a2090
[ 4.316105] R13: ffffffff830f1e68 R14: 0000000000000000 R15: ffffffff84062520
[ 4.316106] FS: 00007664d0f6a8c0(0000) GS:ffff9d3cff600000(0000) knlGS:0000000000000000
[ 4.316108] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.316109] CR2: 0000000000000008 CR3: 0000000105a84002 CR4: 0000000000f70ef0
[ 4.316110] PKRU: 55555554
[ 4.316111] Call Trace:
[ 4.316113]
For what it's worth, I'm getting the same on 6.6.52 immediately after I run my window manager
kernel: BUG: kernel NULL pointer dereference, address: 000000000000057d
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
edit - tested 6.11.0, same thing.
FWIW, I'm using an i5-1240P, so 12th gen Xe Graphics.
This patch worked for me
diff --git a/drivers/gpu/drm/i915/display/intel_atomic_plane.c b/drivers/gpu/drm/i915/display/intel_atomic_plane.c index 230e00c..d36a41b 100644 --- a/drivers/gpu/drm/i915/display/intel_atomic_plane.c +++ b/drivers/gpu/drm/i915/display/intel_atomic_plane.c @@ -1152,7 +1152,7 @@ intel_cleanup_plane_fb(struct drm_plane *plane, if (!obj) return;
The same problem occurs, pve kernel 6.5.13-5, i915-sriov-dkms 2024.7.17. i915-sriov-dkms/2024.07.17, 6.5.13-5-pve, x86_64: installed
I installed xfce4 and connected it to the monitor.
I am delighted that kernel 6.8 is now supported. Unfortunately, however, freezes occur after a short time under Proxmox with kernel 6.8.8.3-pve. The error entry with dmesg is:
[ 800.513559] BUG: kernel NULL pointer dereference, address: 000000000000057d [800.513567] #PF: supervisor read access in kernel mode [ 800.513569] #PF: error_code(0x0000) - not-present page
My Proxmox host is a Minisforum NAB6 with an i7-12650h processor and its integrated Intel UHD Graphics for 12th Gen Intel Processors. I had no problems of this kind with kernel versions 6.2 and 6.5.
It would be great if the error could be fixed, thanks.