intel / gvt-linux

Other
509 stars 95 forks source link

Intel gvt-g with nvidia prime render offload causes several issues #162

Open grazzolini opened 4 years ago

grazzolini commented 4 years ago

This comment was edited

I have started using intel gvt-g a few months ago and, at the time, I was also using nvidia prime render offload method for offloading to the nvidia card. I had issues getting gvt-g to work on any recent kernel (> 5.5), but I had success using the 5.4 lts branch on Arch Linux.

Very recently, however, with the advent of the version 5.4.51 of the lts kernel, as well the mesa 20.1.3 update, my vm with gvt-g would load normally and when X would get started, it would hang, causing 100% cpu usage on one of the hosts CPU's with a blank screen.

Reverting things to kernel 5.4.50 and previous mesa, didn't seem to work. This is where a recent nvidia update, 450.57 comes in. It was the responsible for this new behavior of making my VM to hang.

During my investigations, I've found out that, if using nvidia prime render offload, the VM using the virtualized intel card would work, up until the previously mentioned versions of the software, but, it would not display the mouse pointer properly and also, I would only be able to see the boot from early userspace onward, no POST, no bootloader, nothing.

So, I've done a few experiments with using bbswitch to keep the nvidia card not only powered off, but also to prevent nvidia modules to be loaded, making sure prime render offload is not used. That gave me garbled graphics as in #152. So, I have added the workaround. Not only the screen is not garbled anymore, but I can also see the POST messages.

Also, I'm able to run with gvt-g on the latest mainline kernel, 5.7.8, without any issues. As long as I'm not using the prime render offload method, everything works. I have no idea why this interferes with the virtualized card, but my bet is on some weird mesa/xorg issue.

grazzolini commented 4 years ago

I have done a new round of testing in light of recent mesa updates and also new 5.8 kernel. Here are the results:

When using nvidia prime render offload:

linux 5.8.0: The VM boots, no BIOS is displayed (with or without the MESA_LOADER_DRIVER_OVERRIDE workaround), only the output from the early userspace onward is shown. When trying to starting X though, qemu will segfault and the machine will be stopped.

I've tested both kernel 5.8 and 5.4 on the guest and it seems the kernel that is running on the guest doesn't matter for reproducibility of this issue.

linux 5.4.57: The VM boots, no BIOS can be seen (with or without the MESA_LOADER_DRIVER_OVERRIDE workaround) , and X starts, but then the machine hangs on 100% CPU and needs to be destroyed. Once again, the kernel running on the guest seems to play no part in what happens.

When using bbswitch + bumblebee:

linux 5.8.0: The VM boots and X can be started normally. Even the BIOS can be seen (with MESA_LOADER_DRIVER_OVERRIDE workaround). But, if the VM is stopped and the host is suspended, the VM cannot be started again, it stays stuck right after loading the initramfs (when it loads the i915 module). Again, the kernel being ran on the guest plays no part on this. Tried both 5.8.0 and 5.4.57.

linux 5.4.57: It's the only combination that works flawlessly. For not getting garbled graphics, the MESA_LOADER_DRIVER_OVERRIDE workaround is needed. But, the host can be suspended, after the VM is stopped, and when resuming the host, the VM can be started again.

Suspending the host while the VM is powered on never worked properly for me regardless of kernel version, or if using prime render offload or not. The VM would either crash when the host would be suspended or X would crash on the guest VM.

rnd-ash commented 3 years ago

I can confirm issues with my i7 8750hq and GTX 1060 Maxq setup on my Razer blade 2018 (Kernel 5.8.14-arch1-1):

With NVIDIA GPU enable: Starting up the VM my laptop screen goes black, and all the external monitors change their positions. Also, all UI elements on both the VM and host become messed up when the mouse hovers over them.

grazzolini commented 3 years ago

I can confirm issues with my i7 8750hq and GTX 1060 Maxq setup on my Razer blade 2018 (Kernel 5.8.14-arch1-1):

With NVIDIA GPU enable: Starting up the VM my laptop screen goes black, and all the external monitors change their positions. Also, all UI elements on both the VM and host become messed up when the mouse hovers over them.

I never got this kind of issue, the VM wasn't able to affect the host. Then again, I didn't try again with newer kernels. It's harder to troubleshoot this, since the nvidia driver is a blob.

rnd-ash commented 3 years ago

Here is a link to it actually messing up with the VM turning on: https://streamable.com/75f91k

No segfaults or anything being reported in DMESG

Its even stranger....once the display crashes, the mouse can move my entire Xorg desktop to the left or right...?????

TinaZhangZW commented 3 years ago

The dma-buf display provided by gvt-g aims to provide a way for igfx or cpu to access the guest framebuffer w/o any copy. It seems that this issue is cased by letting other vendor's gpu consume gvt-g vGPU's framebuffer, which isn't supported by gvt-g.

grazzolini commented 3 years ago

@TinaZhangZW So, this is happening because that's precisely what the nvidia driver does in order to be able to display things using the iGPU's (intel) framebuffer and, for some reason, it's also trying to use the vGPU one, causing the corruption? In that case, I think the fix should be done on the nvidia's side. But, also, there isn't any way to prevent this from happening from the vGPU side?

TinaZhangZW commented 3 years ago

Yes, the information of the gvt-g vGPU's framebuffer can only be understood correctly by i915, as there're some vendor specific formats which are only supported by i915. Since guest takes vGPU as a full functional iGFX, there's no way to stop guest from using those i915 specific formats.

I think the solution might be in host user space. If the vGPU's framebuffer consumer in host user space (e.g. QEMU UI) can first copy the vGPU's framebuffer into another framebuffer which has general format (e.g. something like RGBA8888) and can be recognized by any GPU, then the problem could be solved. However, this will introduce more buffer copy.

rugubara commented 3 years ago

I've posted a note on the nvidia dev forum about this issue: https://forums.developer.nvidia.com/t/issues-with-igvt-g-and-virtual-machines/157923. Everybody's welcome to reply to let nvidia know I'm not a lone freak suffering from this.

rugubara commented 3 years ago

Recently I stumbled upon this issue with the nouveau drivers managing my optimus Quadro 2000 card. As long as nouveau module is loaded, the behaviour is exactly the same. No boot animation, guest crashes as soon as the mouse pointer enters the window if the guest has initialized the intel igfx driver.

grazzolini commented 3 years ago

I haven't tested with nouveau. But, if this is also happening with it, then the issue isn't on the nvidia proprietary driver. There might be some common element to both that interact badly with the intel gvt-g framebuffer. My guess, from my investigations so far, is that, if X is using two cards (I wouldn't be surprised if this manifested as well with an AMD card as dGPU), and you create vGPU and start a guest, the VM framebuffer is accessing/writing to a region of memory it shouldn't. We could try to get a full stacktrace from qemu, because on the more recent kernel, the guest still segfaults immediately.

evelikov commented 3 years ago

@grazzolini sounds like there may be multiple bugs here:

Userspace: Visual changes when using MESA_LOADER_DRIVER_OVERRIDE - indicates a regression in the "iris" driver relative to the "i965" one. Capture an apitrace of the issue and report it to mesa

Kernel: 5.8.0 vs 5.4.57 regression (under BB) - consider bisecting the kernel to the commits which caused the breakage

nvidia and/or X stack: Please try nouveau, as you mentioned. Additionally having a) crash dump and b) gdb output for the 5.8.0 and 5.4.57 cases (respectively) will give devs some ideas what is happening.

Aside: bbswitch + bumblebee - for the sake of simplicity, ignore bbswitch and consider using only the rendering part of bumblebee (virtualgl or primus) while experimenting with both rendering options

rugubara commented 3 years ago

backtrace. I've compiled qemu with debug symbols and I'm quite surprised with this stack trace. The core is 666M. This is the crash with X11/nvidia-drivers core.qemu-system-x86.77.ea9bb6ae1f524607977258d1919762ae.2153795.1607849262000000.gz

PF16W6Y2 /usr/src/linux # coredumpctl debug 2153795
           PID: 2153795 (qemu-system-x86)
           UID: 77 (qemu)
           GID: 77 (qemu)
        Signal: 11 (SEGV)
     Timestamp: Sun 2020-12-13 11:47:42 MSK (1min 19s ago)
  Command Line: /usr/bin/qemu-system-x86_64 -name guest=W10-UEFI-Personal,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-W10-UEFI-Personal/master-key.aes -blockdev {"driver":"file","filename":"/usr/share/edk2-ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/W10-UEFI-Personal_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"} -machine pc-q35-3.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,memory-backend=pc.ram -cpu qemu64,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,hv-vpindex,hv-runtime,hv-synic,hv-stimer,hv-tlbflush,hv-ipi,hv-evmcs -m 4096 -object memory-backend-ram,id=pc.ram,size=4294967296 -overcommit mem-lock=off -smp 4,sockets=1,dies=1,cores=1,threads=4 -uuid e7a2eb46-e551-4a5f-a355-441b4fba9b2f -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=28,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=0 -boot menu=off,strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x8,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x9,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x2 -device pcie-pci-bridge,id=pci.6,bus=pci.5,addr=0x0 -device qemu-xhci,p2=15,p3=15,id=usb,bus=pcie.0,addr=0x5 -device virtio-scsi-pci,id=scsi0,num_queues=4,bus=pcie.0,addr=0x9 -device virtio-serial-pci,id=virtio-serial0,bus=pcie.0,addr=0x7 -blockdev {"driver":"host_device","filename":"/dev/zvol/fast/VM/storage/W10-UEFI-Personal/Data","aio":"io_uring","node-name":"libvirt-3-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-3-format","read-only":false,"discard":"unmap","detect-zeroes":"off","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-3-storage"} -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=3,device_id=drive-scsi0-0-0-3,drive=libvirt-3-format,id=scsi0-0-0-3,write-cache=on -blockdev {"driver":"host_device","filename":"/dev/zvol/fast/VM/storage/W10-UEFI-Personal/W10-1903","aio":"io_uring","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","detect-zeroes":"off","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"} -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,device_id=drive-scsi0-0-0-5,drive=libvirt-2-format,id=scsi0-0-0-5,bootindex=2,write-cache=on -blockdev {"driver":"file","filename":"/var/lib/libvirt/images/virtio-win.iso","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"} -device ide-cd,bus=ide.1,share-rw=on,drive=libvirt-1-format,id=sata0-0-1,write-cache=on -fsdev local,security_model=mapped,id=fsdev-fs0,path=/home/Media -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=h:,bus=pcie.0,addr=0xb -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:26:89:21,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -chardev spiceport,id=charchannel1,name=org.spice-space.webdav.0 -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.spice-space.webdav.0 -tpmdev emulator,id=tpm-tpm0,chardev=chrtpm -chardev socket,id=chrtpm,path=/run/libvirt/qemu/swtpm/1-W10-UEFI-Personal-swtpm.sock -device tpm-tis,tpmdev=tpm-tpm0,id=tpm0 -device usb-tablet,id=input2,bus=usb.0,port=1 -spice port=0,disable-ticketing,gl=on,rendernode=/dev/dri/renderD129,seamless-migration=on -device ich9-intel-hda,id=sound0,bus=pcie.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/e7a2eb46-e551-4a5f-a355-441b4fba9b2f,display=on,bus=pci.2,addr=0x0 -device virtio-balloon-pci,id=balloon0,bus=pcie.0,addr=0x8 -set device.hostdev0.x-igd-opregion=on -set device.hostdev0.romfile=/var/lib/libvirt/images/vbios_gvt_uefi.rom -set device.hostdev0.ramfb=on -set device.hostdev0.driver=vfio-pci-nohotplug -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
    Executable: /usr/bin/qemu-system-x86_64
 Control Group: /machine.slice/machine-qemu\x2d1\x2dW10\x2dUEFI\x2dPersonal.scope
          Unit: machine-qemu\x2d1\x2dW10\x2dUEFI\x2dPersonal.scope
         Slice: machine.slice
       Boot ID: ea9bb6ae1f524607977258d1919762ae
    Machine ID: 2c7f8e45cda78c341d153c0f5bc203b4
      Hostname: PF16W6Y2
       Storage: /var/lib/systemd/coredump/core.qemu-system-x86.77.ea9bb6ae1f524607977258d1919762ae.2153795.1607849262000000
       Message: Process 2153795 (qemu-system-x86) of user 77 dumped core.

GNU gdb (Gentoo 10.1 vanilla) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/qemu-system-x86_64...
Reading symbols from /usr/lib/debug//usr/bin/qemu-system-x86_64.debug...

warning: Can't open file anon_inode:[vfio-device] which was expanded to anon_inode:[vfio-device] during file-backed mapping note processing

warning: Can't open file /var/lib/libvirt/qemu/domain-1-W10-UEFI-Personal/.cache/mesa_shader_cache/index during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:3 which was expanded to anon_inode:kvm-vcpu:3 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:2 which was expanded to anon_inode:kvm-vcpu:2 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:1 which was expanded to anon_inode:kvm-vcpu:1 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:0 which was expanded to anon_inode:kvm-vcpu:0 during file-backed mapping note processing

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: core file may not match specified executable file.
[New LWP 2153882]
[New LWP 2153880]
[New LWP 2153894]
[New LWP 2153885]
[New LWP 2153889]
[New LWP 2153886]
[New LWP 2153888]
[New LWP 2153914]
[New LWP 2153895]
[New LWP 2153890]
[New LWP 2153881]
[New LWP 2153887]
[New LWP 2153891]
[New LWP 2153795]
[New LWP 2153831]
[New LWP 2153883]
[New LWP 2153905]
[New LWP 2153897]
[New LWP 2153893]
[New LWP 2153884]
[New LWP 2153892]
[New LWP 2153918]
[New LWP 2153917]
[New LWP 2153879]
[New LWP 2153900]
[New LWP 2153898]
[New LWP 2153906]
[New LWP 2153896]
[New LWP 2153920]
[New LWP 2153913]
[New LWP 2154066]
[New LWP 2153902]
[New LWP 2153904]
[New LWP 2153901]
[New LWP 2153903]
[New LWP 2153899]
[New LWP 2153916]
[New LWP 2153915]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `/usr/bin/qemu-system-x86_64 -name guest=W10-UEFI-Personal,debug-threads=on -S -'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f7330019a00 in ?? ()
[Current thread is 1 (Thread 0x7f7324a74640 (LWP 2153882))]
(gdb) bt
#0  0x00007f7330019a00 in  ()
#1  0x3a5a740e00000000 in  ()
#2  0x0000000000000000 in  ()
rugubara commented 3 years ago

another stack trace with X11/nouveau

PF16W6Y2 ~ # coredumpctl debug 24548
           PID: 24548 (qemu-system-x86)
           UID: 77 (qemu)
           GID: 77 (qemu)
        Signal: 11 (SEGV)
     Timestamp: Sun 2020-12-13 12:27:34 MSK (39s ago)
  Command Line: /usr/bin/qemu-system-x86_64 -name guest=W10-UEFI-Personal,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-W10-UEFI-Personal/master-key.aes -blockdev {"driver":"file","filename":"/usr/share/edk2-ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/W10-UEFI-Personal_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"} -machine pc-q35-3.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,memory-backend=pc.ram -cpu qemu64,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,hv-vpindex,hv-runtime,hv-synic,hv-stimer,hv-tlbflush,hv-ipi,hv-evmcs -m 4096 -object memory-backend-ram,id=pc.ram,size=4294967296 -overcommit mem-lock=off -smp 4,sockets=1,dies=1,cores=1,threads=4 -uuid e7a2eb46-e551-4a5f-a355-441b4fba9b2f -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=29,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=0 -boot menu=off,strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x8,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x9,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x2 -device pcie-pci-bridge,id=pci.6,bus=pci.5,addr=0x0 -device qemu-xhci,p2=15,p3=15,id=usb,bus=pcie.0,addr=0x5 -device virtio-scsi-pci,id=scsi0,num_queues=4,bus=pcie.0,addr=0x9 -device virtio-serial-pci,id=virtio-serial0,bus=pcie.0,addr=0x7 -blockdev {"driver":"host_device","filename":"/dev/zvol/fast/VM/storage/W10-UEFI-Personal/Data","aio":"io_uring","node-name":"libvirt-3-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-3-format","read-only":false,"discard":"unmap","detect-zeroes":"off","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-3-storage"} -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=3,device_id=drive-scsi0-0-0-3,drive=libvirt-3-format,id=scsi0-0-0-3,write-cache=on -blockdev {"driver":"host_device","filename":"/dev/zvol/fast/VM/storage/W10-UEFI-Personal/W10-1903","aio":"io_uring","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","detect-zeroes":"off","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"} -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=5,device_id=drive-scsi0-0-0-5,drive=libvirt-2-format,id=scsi0-0-0-5,bootindex=2,write-cache=on -blockdev {"driver":"file","filename":"/var/lib/libvirt/images/virtio-win.iso","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"} -device ide-cd,bus=ide.1,share-rw=on,drive=libvirt-1-format,id=sata0-0-1,write-cache=on -fsdev local,security_model=mapped,id=fsdev-fs0,path=/home/Media -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=h:,bus=pcie.0,addr=0xb -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:26:89:21,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -chardev spiceport,id=charchannel1,name=org.spice-space.webdav.0 -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.spice-space.webdav.0 -tpmdev emulator,id=tpm-tpm0,chardev=chrtpm -chardev socket,id=chrtpm,path=/run/libvirt/qemu/swtpm/2-W10-UEFI-Personal-swtpm.sock -device tpm-tis,tpmdev=tpm-tpm0,id=tpm0 -device usb-tablet,id=input2,bus=usb.0,port=1 -spice port=0,disable-ticketing,gl=on,rendernode=/dev/dri/renderD129,seamless-migration=on -device ich9-intel-hda,id=sound0,bus=pcie.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/e7a2eb46-e551-4a5f-a355-441b4fba9b2f,display=on,bus=pci.2,addr=0x0 -device virtio-balloon-pci,id=balloon0,bus=pcie.0,addr=0x8 -set device.hostdev0.x-igd-opregion=on -set device.hostdev0.romfile=/var/lib/libvirt/images/vbios_gvt_uefi.rom -set device.hostdev0.ramfb=on -set device.hostdev0.driver=vfio-pci-nohotplug -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
    Executable: /usr/bin/qemu-system-x86_64
 Control Group: /machine.slice/machine-qemu\x2d2\x2dW10\x2dUEFI\x2dPersonal.scope
          Unit: machine-qemu\x2d2\x2dW10\x2dUEFI\x2dPersonal.scope
         Slice: machine.slice
       Boot ID: 8b23263623dd4832ac5a316482d087ba
    Machine ID: 2c7f8e45cda78c341d153c0f5bc203b4
      Hostname: PF16W6Y2
       Storage: /var/lib/systemd/coredump/core.qemu-system-x86.77.8b23263623dd4832ac5a316482d087ba.24548.1607851654000000
       Message: Process 24548 (qemu-system-x86) of user 77 dumped core.

GNU gdb (Gentoo 10.1 vanilla) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/qemu-system-x86_64...
Reading symbols from /usr/lib/debug//usr/bin/qemu-system-x86_64.debug...

warning: Can't open file anon_inode:[vfio-device] which was expanded to anon_inode:[vfio-device] during file-backed mapping note processing

warning: Can't open file /var/lib/libvirt/qemu/domain-2-W10-UEFI-Personal/.cache/mesa_shader_cache/index during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:3 which was expanded to anon_inode:kvm-vcpu:3 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:2 which was expanded to anon_inode:kvm-vcpu:2 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:1 which was expanded to anon_inode:kvm-vcpu:1 during file-backed mapping note processing

warning: Can't open file anon_inode:kvm-vcpu:0 which was expanded to anon_inode:kvm-vcpu:0 during file-backed mapping note processing

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: core file may not match specified executable file.
[New LWP 24647]
[New LWP 24650]
[New LWP 24668]
[New LWP 24548]
[New LWP 24657]
[New LWP 24658]
[New LWP 24656]
[New LWP 24659]
[New LWP 24661]
[New LWP 24662]
[New LWP 24570]
[New LWP 24660]
[New LWP 24663]
[New LWP 24665]
[New LWP 24651]
[New LWP 24645]
[New LWP 24644]
[New LWP 24654]
[New LWP 24646]
[New LWP 24648]
[New LWP 24652]
[New LWP 24664]
[New LWP 24653]
[New LWP 24655]
[New LWP 24666]
[New LWP 24667]
[New LWP 24669]
[New LWP 24670]
[New LWP 24671]
[New LWP 24678]
[New LWP 24679]
[New LWP 24680]
[New LWP 24681]
[New LWP 24683]
[New LWP 24682]
[New LWP 24695]
[New LWP 28397]
[New LWP 24649]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging-- 
Core was generated by `/usr/bin/qemu-system-x86_64 -name guest=W10-UEFI-Personal,debug-threads=on -S -'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f77e8009a00 in ?? ()
[Current thread is 1 (Thread 0x7f77d7fff640 (LWP 24647))]
(gdb) bt
#0  0x00007f77e8009a00 in  ()
#1  0x3a5a740e00000000 in  ()
#2  0x0000000000000000 in  ()

core.qemu-system-x86.77.8b23263623dd4832ac5a316482d087ba.24548.1607851654000000.gz

I have my qemu binary and the debug symbols directory (38M zipped). https://yadi.sk/d/ZP7ypvPIrrY2Bg

evelikov commented 3 years ago

The stack, despite the missing symbols, seems almost identical. So the odds of a Nvidia driver (related) bug are small.

If it were me, I would start with the kernel regression. Its clear-cut, mechanical process to figure out ("git help bisect" for details), plus that it may be responsible for the qemu crash.

rugubara commented 3 years ago

I'm OK to bisect, but I don't have the good commit to start from. v5.4 (from the officia https://github.com/torvalds/linux.git) l that you've mentioned earlier doesn't boot on my hardware. What is the commit I should start from?

rugubara commented 3 years ago

I managed to build a bootable 5.2 kernel for my hardware and did a bisect. It landed on PF16W6Y2 /usr/src/linux # git bisect good 1b032ec1ecbce6047af7d11c9db432e237cb17d8 is the first bad commit commit 1b032ec1ecbce6047af7d11c9db432e237cb17d8 Author: Joerg Roedel jroedel@suse.de Date: Wed Apr 29 15:37:12 2020 +0200

iommu: Unexport iommu_group_get_for_dev()

The function is now only used in IOMMU core code and shouldn't be used
outside of it anyway, so remove the export for it.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20200429133712.31431-35-joro@8bytes.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>

drivers/iommu/iommu.c | 4 ++-- include/linux/iommu.h | 1 - 2 files changed, 2 insertions(+), 3 deletions(-)

I have the following hardware: PF16W6Y2 /usr/src/linux # uname -a Linux PF16W6Y2 5.7.0-rc3+ #22 SMP Sat Dec 19 14:18:50 MSK 2020 x86_64 Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz GenuineIntel GNU/Linux System Information Manufacturer: LENOVO Product Name: 20M90019RT Version: ThinkPad P52 Serial Number: PF16W6Y2 UUID: a191484c-225d-11b2-a85c-e821bdf6066b Wake-up Type: Power Switch SKU Number: LENOVO_MT_20M9_BU_Think_FM_ThinkPad P52 Family: ThinkPad P52 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1) 01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1) PF16W6Y2 /usr/src/linux # nvidia-settings --version

nvidia-settings: version 455.45.01 The NVIDIA X Server Settings tool.

This program is used to configure the NVIDIA Linux graphics driver. For more detail, please see the nvidia-settings(1) man page.

evelikov commented 3 years ago

Thanks for the bisection @rugubara. Can you confirm that this is the offending commit, by reverting it on top of 5.8?

I cannot see any in-tree users of the API (outside of the said file), plus it seems that the Nvidia driver does not use it either. Which makes me wonder - the failing kernel - is it from vanilla from kernel.org, or there are some patches applied on top? Note: distributions often apply patches, even Arch does

grazzolini commented 3 years ago

@evelikov Yes, Arch does apply patches, but as far as I know, the only patch that's applied to the Arch kernel is to enable the sysctl for userns.

evelikov commented 3 years ago

@evelikov Yes, Arch does apply patches, but as far as I know, the only patch that's applied to the Arch kernel is to enable the sysctl for userns.

The patch set varies across kernel releases. I'm not sh*tting on Arch (been using it for 10 years), but pointing out that nearly all distros patch it.

Can you can confirm the bisection result on your end?

grazzolini commented 3 years ago

@evelikov I was not able yet to patch the recent kernel to confirm the issue. Also, I needed to use a more recent kernel, so I'm not currently using gvt-g (I'm using egl-headless). But I might have some time next week, will give it a try.

rugubara commented 3 years ago

Thanks for the bisection @rugubara. Can you confirm that this is the offending commit, by reverting it on top of 5.8?

I cannot see any in-tree users of the API (outside of the said file), plus it seems that the Nvidia driver does not use it either. Which makes me wonder - the failing kernel - is it from vanilla from kernel.org, or there are some patches applied on top? Note: distributions often apply patches, even Arch does

I did all tests on vanilla kernel from kernel.org. I did a test on v5.8 with this commit reverted - still qemu crashes. So I did another bisect and it landed on

PF16W6Y2 /usr/src/linux # git bisect log
git bisect start
# good: [3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162] Linux 5.7
git bisect good 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
# bad: [bcf876870b95592b52519ed4aafcf9d95999bc9c] Linux 5.8
git bisect bad bcf876870b95592b52519ed4aafcf9d95999bc9c
# good: [694b5a5d313f3997764b67d52bab66ec7e59e714] Merge tag 'arm-soc-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 694b5a5d313f3997764b67d52bab66ec7e59e714
# bad: [595a56ac1b0d5f0a16a89589ef55ffd35c1967a2] Merge tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
git bisect bad 595a56ac1b0d5f0a16a89589ef55ffd35c1967a2
# good: [9fa88c5d3f5eae3e68ef20d226c3f13e21490668] hpfs: fix warning due to superfluous semicolon
git bisect good 9fa88c5d3f5eae3e68ef20d226c3f13e21490668
# skip: [80ef846e9909f22ccdc2a4a6d931266cecce8b2c] Merge tag 'staging-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect skip 80ef846e9909f22ccdc2a4a6d931266cecce8b2c
# good: [af5c2174ca6d9eb5d31a36fb057bbcf2aaac6f6c] iio: adc: at91-adc: Use devm_platform_ioremap_resource
git bisect good af5c2174ca6d9eb5d31a36fb057bbcf2aaac6f6c
# good: [f03c9b7884720973d1673fbb64f808897ca88a12] staging: fbtft: fb_st7789v: Initialize the Display
git bisect good f03c9b7884720973d1673fbb64f808897ca88a12
# good: [f558b8364e19f9222e7976c64e9367f66bab02cc] Merge tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect good f558b8364e19f9222e7976c64e9367f66bab02cc
# good: [20b0d06722169e6e66049c8fe6f1a48adffb79c6] Merge branch 'akpm' (patches from Andrew)
git bisect good 20b0d06722169e6e66049c8fe6f1a48adffb79c6
# bad: [23fc02e36e4f657af242e59175c891b27c704935] Merge tag 's390-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad 23fc02e36e4f657af242e59175c891b27c704935
# bad: [431275afdc7155415254aef4bd3816a1b8a2ead0] iommu: Check for deferred attach in iommu_group_do_dma_attach()
git bisect bad 431275afdc7155415254aef4bd3816a1b8a2ead0
# bad: [8a1d824625402b3ef3c3e5965663354ff0394d86] iommu/vt-d: Multiple descriptors per qi_submit_sync()
git bisect bad 8a1d824625402b3ef3c3e5965663354ff0394d86
# good: [c822b37cac48ea0e4c8202a42fdc480ace099b12] iommu/omap: Remove orphan_dev tracking
git bisect good c822b37cac48ea0e4c8202a42fdc480ace099b12
# bad: [6fc7020cf298aaec343df423746b44d99c6efaa5] iommu/vt-d: Apply per-device dma_ops
git bisect bad 6fc7020cf298aaec343df423746b44d99c6efaa5
# good: [cfcccbe8879f79bc9f8a162bcb482c74b8768094] iommu/amd: Fix variable "iommu" set but not used
git bisect good cfcccbe8879f79bc9f8a162bcb482c74b8768094
# good: [3a0ce12e3b8e3cb7d54569a42aec743cc93f4f0d] iommu/iova: Unify format of the printed messages
git bisect good 3a0ce12e3b8e3cb7d54569a42aec743cc93f4f0d
# bad: [327d5b2fee91c404a3956c324193892cf2cc9528] iommu/vt-d: Allow 32bit devices to uses DMA domain
git bisect bad 327d5b2fee91c404a3956c324193892cf2cc9528
# good: [ec9b40cffdb68c4ea1ebdcd1648ed6ce15c4449e] Merge tag 'v5.7-rc4' into core
git bisect good ec9b40cffdb68c4ea1ebdcd1648ed6ce15c4449e
# first bad commit: [327d5b2fee91c404a3956c324193892cf2cc9528] iommu/vt-d: Allow 32bit devices to uses DMA domain

Cannot do the verification with v5.8:

PF16W6Y2 /usr/src/linux # git apply -R offending.patch
error: drivers/iommu/intel-iommu.c: No such file or directory
WhyDoWeWonder commented 3 years ago

@rugubara, looking at the history of the kernel drivers/iommu/intel-iommu.c was removed in 5.8, could you try reversing the patch on top of 5.7 and see if that fixes it?

rugubara commented 3 years ago

5.7 works OK by itself.

пт, 1 янв. 2021 г., 03:18 WhyDoWeWonder notifications@github.com:

Looking at the history of the kernel drivers/iommu/intel-iommu.c was removed in 5.8, could you try reversing the patch on top of 5.7 and see if that fixes it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intel/gvt-linux/issues/162#issuecomment-753230092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHHCBZIGUCGMYMINFHZH2DSXUIF5ANCNFSM4OZJSRXQ .

WhyDoWeWonder commented 3 years ago

By 5.7 works OK by itself do you mean that GVT-g works with no issues with prime render offload on 5.7? Could you explain the current situation with what works and what doesn't on the kernel version?

rugubara commented 3 years ago

Please refer to git bisect log above: https://github.com/intel/gvt-linux/issues/162#issuecomment-750175772 I tested if the windows 10 gvt-g guest with spice girt-viewer survives boot, login and normal shutdown w/o a crash with prime render offload enabled with nvidia-drivers.

5.7 vanilla kernel passed. However I wasn't able to see the boot animation any test. I can see the boot animation when the nvidia-drivers are not loaded.

evelikov commented 3 years ago

@WhyDoWeWonder welcome to the thread. Since there's a lot of information in here, I'd suggest re-reading it at least a couple of times - otherwise you're bound to get confused. I know I did :-)

@rugubara looking at the iommu changes upstream I do NOT think there's an easy way to revert. The next reasonable step I see is sending an email to the mailing list, CC-ing the maintainers and the commit author - quick summary with a link to this thread should suffice.

Edit: I think -> I do NOT think

rugubara commented 3 years ago

@rugubara looking at the iommu changes upstream I do NOT think there's an easy way to revert. The next reasonable step I see is sending an email to the mailing list, CC-ing the maintainers and the commit author - quick summary with a link to this thread should suffice.

I did sent an email to the list https://lore.kernel.org/linux-iommu/CAP=18J5QjRpqix2eZgNgcUROPJXk_E0woE5J7DVT51eDGSfFAQ@mail.gmail.com/ however it didn't seem to provoke any response.

evelikov commented 3 years ago

Hmm no reply from upstream after 3 weeks. I have an idea to try over the weekend - fingers crossed it will get their attention little better.

rugubara commented 3 years ago

I repeated my tests with 5.11.10 and nvidia-drivers-460.67 The guest no longer crashes, however there is guest screen freeze when mouse is in the window. The mouse cursor is not visible. The guest processes (blind) mouse clicks and key presses without any visual feedback. The window gets refreshed when the mouse is moved outside of the window. 2021-03-31 22-30-11.zip I've uploaded a small video illustrating the window refresh delay