intel / gvt-linux

Other
503 stars 94 forks source link

Windows 10 VM running BlueIris has igfx driver crash every few days. #228

Open bheikes1 opened 1 year ago

bheikes1 commented 1 year ago

Greetings all,

Looking for some hints as to what might be the issue with my setup. I have a Windows 10 VM running BlueIris that has started exhibiting igfx driver crashes approximately a month ago. Previously, this system was stable with uptimes of several months with no issues.

Host system: Proxmox 7.4-3 Kernels recently used 6.2, 6.1, 5.19, 5.15, 5.13 Intel E-2186G, 128 GB ram, Nvidia T1000, LSI HBA

VMs: Ubuntu 22.04 running PiHole, no issues noted TrueNas Core, has LSI HBA passed through, no issues noted Ubuntu 22.04 running Portainer, has Nvidia T1000 passed through, no issues noted Windows 10 22H2, has Intel igpu p630 passed through (GVT-d), igfx driver crashes every few days.

This setup has been in place for approximately a year with virtually no issues until approximately a month ago (March 8th from my notes). In the last week or so, I've worked my way through linux kernels 5.19, 6.1, 6.2, as well as trying out GVT-g to see if i could stop the igfx driver crashes. Using GVT-g, when the crash happens the VM would stop responding completely, and cause issues with the host as well necessitating a host reboot. Using GVT-d, only the VM needs to be rebooted.

Under the 6.1 and 6.2 (and perhaps 5.19) kernels using GVT-G I get syslog entries (on host) like this when a crash happens

Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to flush post shadow Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to dispatch workload, skip Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c000 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c008 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c010 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c018 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c020

and

Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c948 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 17 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 15 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6ca80 Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages

Under 6.2 and 6.1 using GVT-d I get messages like this when a crash happens

Mar 26 07:20:45 pve kernel: DMAR: DRHD: handling fault status reg 3 Mar 26 07:20:45 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8024c046000 [fault reason 0x07] Next page table ptr is invalid Mar 29 12:08:47 pve kernel: DMAR: DRHD: handling fault status reg 3 Mar 29 12:08:47 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff8004014b4000 [fault reason 0x07] Next page table ptr is invalid Mar 31 05:48:36 pve kernel: DMAR: DRHD: handling fault status reg 3 Mar 31 05:48:36 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800417686000 [fault reason 0x07] Next page table ptr is invalid

I'm trying out older kernels now (currently 5.13) to see if there is any appreciable difference. I do realize that I am running quite a complicated system, and might be bumping up against an edge case.

Any thoughts?

bheikes1 commented 1 year ago

Did some testing on linux kernel 5.13 over the last month and the behavior noted above completely resolved.

Moving up to kernel 5.15 now, since it's actually being maintained.

bheikes1 commented 1 year ago

Running on 5.15, I was able to get about 3 weeks out of the system before I noticed this in the syslog, and a crashed video driver on the Win10 guest.

May 16 05:26:30 pve kernel: DMAR: DRHD: handling fault status reg 3 May 16 05:26:30 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8024cff5000 [fault reason 0x07] Next page table ptr is invalid

bheikes1 commented 1 year ago

Same setup and versions as last time, looks like same error.

May 26 03:05:48 pve kernel: DMAR: DRHD: handling fault status reg 3 May 26 03:05:48 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8021a516000 [fault reason 0x07] Next page table ptr is invalid

bheikes1 commented 1 year ago

Same setup as before.

Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800401e73000 [fault reason 0x07] Next page table ptr is invalid Jun 12 02:50:44 pve kernel: DMAR: DRHD: handling fault status reg 2 Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800500e73000 [fault reason 0x07] Next page table ptr is invalid Jun 12 02:50:44 pve kernel: DMAR: DRHD: handling fault status reg 2 Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800401e70000 [fault reason 0x07] Next page table ptr is invalid

tpressure commented 3 months ago

Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c010 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9 Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9. Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c018

For these kinds of errors, you can try the workaround I've posted here: https://github.com/intel/gvt-linux/issues/153#issuecomment-1047603809

It's not a 100% solution though. Check the comments in https://github.com/intel/gvt-linux/issues/153