intel / gvt-linux

Other
506 stars 95 forks source link

Win10 vm keeps crashing using kvmgt #125

Open velde666 opened 4 years ago

velde666 commented 4 years ago

Hi there,

I am using kvmgt on a Intel NUC (Intel HD Graphics 620) running CentOS 7 since several months with several kernels (CentOS 7 standard 3.x, self-compiled 5.x and now from this project 5.4.0-rc7-01779-g74c926f-dirty) and always have the issue that the Windows 10 vm keeps crashing more or less often. Sometimes there is more than a week between crashes, sometimes just hours or minutes.

Win10 is v1903 Win GPU driver is 26.20.100.7000 (everything newer does not work) CentOS is v7.7.1908 kvm-qemu-ev is 2.12.0-33.1 firmware files for i915 are fresh from yesterday (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915)

When the vm crashes I got gazillions of messages like those:

Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 0000000049e48f88 guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: spt 00000000c918a2ce guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000c918a2ce guest entry 0x6735b2906735b29 type 9. Nov 18 15:54:51 floor13 kernel: gvt: guest page write error, gpa 1a2295000 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 0000000049e48f88 guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: spt 00000000c918a2ce guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000c918a2ce guest entry 0x6735b2906735b29 type 9. Nov 18 15:54:51 floor13 kernel: gvt: guest page write error, gpa 1a2295008 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 0000000049e48f88 guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: spt 00000000c918a2ce guest entry 0x6735b2906735b29 type 9 Nov 18 15:54:51 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000c918a2ce guest entry 0x6735b2906735b29 type 9. Nov 18 15:54:51 floor13 kernel: gvt: guest page write error, gpa 1a2295010

ending with

Nov 18 15:54:52 floor13 kernel: gvt: vgpu 1: fail to flush post shadow Nov 18 15:54:52 floor13 kernel: gvt: vgpu 1: fail to dispatch workload, skip

After that I see kernel traces starting with

Nov 18 15:57:23 floor13 kernel: INFO: task gvt_service_thr:289 blocked for more than 122 seconds.

There is no dump generated in /sys/class/drm/card0/error:

[root@floor13 ~]# cat /sys/class/drm/card0/error No error state collected

I will attach all messages to this issue and appreciate any help on this :)

Best regards messages.txt

melyux commented 4 years ago

Ever get this solved? I'm having this problem. Can't keep a VM up for more than a couple days at best.

velde666 commented 4 years ago

Hi @melyux

I am still struggling with this even though I stabilized my Win VM to > 90% I would say.

This is what I have done:

I also tried newer kernels directly from kernel.org (5.5.13 and 5.6.3) but 5.5.13 f**ked up my wifi and 5.6.3 crashed the Windows vm as before although the above-mentioned parameters were not changed. I am wondering if the changes in gvt-linux are getting integrated in standard kernel.

Additionally I updated CentOS 7 to 8 in-place (waaaaahhhh) and Win 10 to 1909. But I don't think this stabilized the Win VM in any way.

Best regards

reedog117 commented 4 years ago

Similar issue - Proxmox VE 6.2 host with 5.4.41-1 kernel. Coffee Lake ER Xeon processor with Intel UHD 630 Graphics. Windows GPU driver is newest available. It seems my problems recur faster with the more VMs I allocate. Using 128MB vGPUs for each guest with a total aperture of 512MB and no more than 3 guest VMs at a time in use.

Boot flags include:

kvm.ignore_msrs=1
i915.enable_execlists=0

I get the above page fault errors plus the following

May 20 21:55:01 virt-slc-11 kernel: [29241.213006] Call Trace:
May 20 21:55:01 virt-slc-11 kernel: [29241.213007]  __schedule+0x2e6/0x6f0
May 20 21:55:01 virt-slc-11 kernel: [29241.213009]  schedule+0x33/0xa0
May 20 21:55:01 virt-slc-11 kernel: [29241.213010]  schedule_preempt_disabled+0xe/0x10
May 20 21:55:01 virt-slc-11 kernel: [29241.213011]  __mutex_lock.isra.10+0x2c9/0x4c0
May 20 21:55:01 virt-slc-11 kernel: [29241.213026]  ? kvm_arch_vcpu_put+0xe2/0x170 [kvm]
May 20 21:55:01 virt-slc-11 kernel: [29241.213028]  __mutex_lock_slowpath+0x13/0x20
May 20 21:55:01 virt-slc-11 kernel: [29241.213029]  mutex_lock+0x2c/0x30
May 20 21:55:01 virt-slc-11 kernel: [29241.213049]  intel_vgpu_emulate_mmio_write+0x68/0x220 [i915]
May 20 21:55:01 virt-slc-11 kernel: [29241.213050]  intel_vgpu_rw+0xb3/0x1f0 [kvmgt]
May 20 21:55:01 virt-slc-11 kernel: [29241.213052]  intel_vgpu_write+0x16e/0x200 [kvmgt]
May 20 21:55:01 virt-slc-11 kernel: [29241.213053]  vfio_mdev_write+0x22/0x30 [vfio_mdev]
May 20 21:55:01 virt-slc-11 kernel: [29241.213054]  vfio_device_fops_write+0x26/0x30 [vfio]
May 20 21:55:01 virt-slc-11 kernel: [29241.213055]  __vfs_write+0x1b/0x40
May 20 21:55:01 virt-slc-11 kernel: [29241.213056]  vfs_write+0xab/0x1b0
May 20 21:55:01 virt-slc-11 kernel: [29241.213057]  ksys_pwrite64+0x66/0xa0
May 20 21:55:01 virt-slc-11 kernel: [29241.213058]  __x64_sys_pwrite64+0x1e/0x20
May 20 21:55:01 virt-slc-11 kernel: [29241.213059]  do_syscall_64+0x57/0x190
May 20 21:55:01 virt-slc-11 kernel: [29241.213060]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 20 21:55:01 virt-slc-11 kernel: [29241.213061] RIP: 0033:0x7f7cb6674edf
May 20 21:55:01 virt-slc-11 kernel: [29241.213062] Code: Bad RIP value.
May 20 21:55:01 virt-slc-11 kernel: [29241.213062] RSP: 002b:00007f7aa7ff97a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
May 20 21:55:01 virt-slc-11 kernel: [29241.213063] RAX: ffffffffffffffda RBX: 0000000000000023 RCX: 00007f7cb6674edf
May 20 21:55:01 virt-slc-11 kernel: [29241.213064] RDX: 0000000000000004 RSI: 00007f7aa7ff97f8 RDI: 0000000000000023
May 20 21:55:01 virt-slc-11 kernel: [29241.213064] RBP: 00007f7aa7ff97f8 R08: 0000000000000000 R09: 00000000ffffffff
May 20 21:55:01 virt-slc-11 kernel: [29241.213065] R10: 000000000000a278 R11: 0000000000000293 R12: 0000000000000004
May 20 21:55:01 virt-slc-11 kernel: [29241.213065] R13: 000000000000a278 R14: 00007f7aa42817c0 R15: 00007f7aa42816f0
reedog117 commented 4 years ago

I found additional debug information but I have a feeling it may be a PPGTT issue with newer processors. I'm covering this in #153

elurex commented 3 years ago

I am having same issue and I am using PVE 6.3-3

[ 2741.241926] gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xfffff80520c07ecf type 9 [ 2741.242392] gvt: vgpu 1: fail: spt 00000000d7d4221d guest entry 0xfffff80520c07ecf type 9 [ 2741.242855] gvt: vgpu 1: fail: shadow page 00000000d7d4221d guest entry 0xfffff80520c07ecf type 9. [ 2741.243324] gvt: guest page write error, gpa 1513e9c78

and I am seeing similiar trace mentions by reedog117

hardwareadictos commented 3 years ago

Same situation here:

gvt: guest page write error, gpa 11795afb8 kernel: gvt: guest page write error, gpa 11795afb8 kernel: gvt: guest page write error, gpa 11795aff8 kernel: guest page write error, gpa 11795aff8 kernel: gvt: guest page write error, gpa 11795aff8 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0x6d006100720067 type 8 kernel: gvt: vgpu 1: fail: shadow page 000000004db426e4 guest entry 0x6d006100720067 type 8 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xcc0c81 type 9 kernel: gvt: vgpu 1: fail: spt 00000000f6915228 guest entry 0xcc0c81 type 9 kerrnel: gvt: vgpu 1: fail: shadow page 00000000f6915228 guest entry 0xcc0c81 type 9. kernel: gvt: vgpu 1: fail to flush post shadow kernel: gvt: vgpu 1: fail to dispatch workload, skip kernel: gvt: vgpu(1) Invalid FORCE_NONPRIV write 2341 at offset 24d8 kernel: gvt: vgpu(1) Invalid FORCE_NONPRIV write 2351 at offset 24dc gvt: vgpu(1) Invalid FORCE_NONPRIV write 10000d82 at offset 24e0 kernel: gvt: vgpu(1) Invalid FORCE_NONPRIV write 10064844 at offset 24e4 kernel: gvt: vgpu(1) Invalid FORCE_NONPRIV write 4000b118 at offset 24f0

VM Just freezes some minutes after boot on Win 10.

tpressure commented 3 years ago

This is still an issue even with the most recent 5.10 kernel. I can see this issue on all Intel CPUs with IGP from Gen6 to Gen10th.

rugubara commented 3 years ago

I'm still seeing this on 5.12.13 kernel. The guest continues to run, but the display is frozen

июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: spt 00000000680cc782 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 00000000680cc782 guest entry 0xffffffffffffffff type 9.
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: spt 00000000680cc782 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 00000000680cc782 guest entry 0xffffffffffffffff type 9.
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: spt 00000000680cc782 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 00000000680cc782 guest entry 0xffffffffffffffff type 9.
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: spt 00000000680cc782 guest entry 0xffffffffffffffff type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 00000000680cc782 guest entry 0xffffffffffffffff type 9.
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804000
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804010
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804020
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804030
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804040
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804050
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804060
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804070
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804080
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804090
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040a0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040b0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040c0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040d0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040e0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b8040f0
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804100
июн 25 22:57:03 PF16W6Y2 kernel: gvt: guest page write error, gpa 2b804108
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xf0f0f0f0f0f0f0f type 9
июн 25 22:57:03 PF16W6Y2 kernel: gvt: vgpu 1: fail: spt 00000000680cc782 guest entry 0xf0f0f0f0f0f0f0f type 9
TerrenceXu commented 2 years ago

You can try the driver 30.0.100.9684 (https://downloadcenter.intel.com/download/30579/Intel-Graphics-Windows-DCH-Drivers) and try again. From our side it is stable with kernel 5.11 (5.12 has a regression #188 and the bug fix patch hasn't been upstream until now).

patrykk commented 2 years ago

Hi. Please test my working solution: https://github.com/intel/gvt-linux/issues/188#issuecomment-955584215