virtio-win / kvm-guest-drivers-windows

Windows paravirtualized drivers for QEMU\KVM
https://www.linux-kvm.org/page/WindowsGuestDrivers
BSD 3-Clause "New" or "Revised" License
2.05k stars 387 forks source link

Windows Operating System Boot Issue, Windows Servers #1139

Open celovasquesjr opened 2 months ago

celovasquesjr commented 2 months ago

I am facing an issue where the Windows Server VM gets stuck on the "TIANO CORE" screen after a reboot. A simple reboot trigger, such as a "Windows Update", causes the problem. It doesn't happen consistently, and it's very difficult to reproduce.

To fix the issue and boot the OS, I have to use the poweroff force button and then start the VM again. After that, it boots normally.

Steps to reproduce the behavior are unclear since the issue doesn't occur consistently. It seems to happen randomly after certain reboot triggers, like Windows Updates.

The VM should reboot and load the OS without getting stuck on the "TIANO CORE" screen.

problem

Could the Windows Update be interfering with any drivers, causing the VM not to boot? Is there a driver issue I should investigate?

(I have already checked logs from inside the VM and on KVM, but nothing useful was generated.)

Has anyone experienced this type of problem before?

YanVugenfirer commented 2 months ago

Looks like UEFI BIOS failure and not the driver failure.

Can you post QEMU command line?

celovasquesjr commented 2 months ago

Hello YanVugenfirer,

Here is the QEMU command line:

LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin \
HOME=/var/lib/libvirt/qemu/domain-280-i-96-880-VM \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-280-i-96-880-VM/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-280-i-96-880-VM/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-280-i-96-880-VM/.config \
/usr/bin/qemu-system-x86_64 \
-name guest=i-96-880-VM,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-280-i-96-880-VM/master-key.aes"}' \
-blockdev '{"driver":"file","filename":"/usr/share/OVMF/OVMF_CODE_4M.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/XXX-XXX-XXX-XXX-XXX.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-6.2,usb=off,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format,memory-backend=pc.ram \
-accel kvm \
-cpu Icelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=
on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,hv-time=on \
-m 4096 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":4294967296}' \
-overcommit mem-lock=off \
-smp 2,sockets=2,cores=1,threads=1 \
-uuid XXX-XXX-XXX-XXX-XXX \
-smbios 'type=1,manufacturer=Apache Software Foundation,product=CloudStack KVM Hypervisor,uuid=XXX-XXX-XXX-XXX-XXX' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=595,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=localtime \
-no-shutdown \
-boot strict=on \
-device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=18,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=19,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=20,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-pci-bridge,id=pci.7,bus=pci.1,addr=0x0 \
-device pcie-root-port,port=22,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x6 \
-device qemu-xhci,id=usb,bus=pci.3,addr=0x0 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 \
-object '{"qom-type":"secret","id":"libvirt-2-storage-auth-secret0","data":"XXX","keyid":"XXX","iv":"XXX","format":"base64"}' \
-blockdev '{"driver":"rbd","pool":"XXX","image":"XXX","server":[{"host":"XXX.XXX.XXX.XXX","port":"0"},{"host":"XXX.XXX.XXX.XXX","port":"0"},{"host":"XXX.XXX.XXX.XXX","port":"0"},{"host":"XXX.XXX.XXX.XXX","port":"0"},{"host":"XXX.XXX.XXX.XXX","port":"0"}],"user":"XXX","auth-client-required":["cephx","none"],"key-secret":"libvirt-2-storage-auth-secret0","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=virtio-disk0,bootindex=2,write-cache=on,serial=f01bb52d5ad847f99170 \
-device ide-cd,bus=ide.3,id=sata0-0-3,bootindex=1 \
-netdev tap,fd=601,id=hostnet0,vhost=on,vhostfd=793 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=XXX:XXX:XXX:XXX:XXX:XXX,bus=pci.2,addr=0x0 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev socket,id=charchannel0,fd=593,server=on,wait=off \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-audiodev '{"id":"audio1","driver":"none"}' \
-object '{"qom-type":"tls-creds-x509","id":"vnc-tls-creds0","dir":"/etc/pki/libvirt-vnc","endpoint":"server","verify-peer":true}' \
-vnc XXX.XXX.XXX.XXX:XXX,password=on,tls-creds=vnc-tls-creds0,audiodev=audio1 \
-device cirrus-vga,id=video0,bus=pcie.0,addr=0x1 \
-device i6300esb,id=watchdog0,bus=pci.7,addr=0x1 \
-watchdog-action none \
-device virtio-balloon-pci,id=balloon0,bus=pci.6,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/111 (label charserial0)

Replace irrelevant information with 'xxx'.

celovasquesjr commented 2 months ago

Hi YanVugenfirer

Can you help me?

YanVugenfirer commented 2 months ago

@celovasquesjr Sorry, I am travelling to a conference. Might take some time. In any case, I think the issue is not related to drivers.

celovasquesjr commented 2 months ago

@YanVugenfirer Thank you! I would appreciate your feedback when you can. Sorry, we really don't know why this is happening.

xiagao commented 2 months ago

Hi @celovasquesjr Could you tell the guest version, host kernel, qemu-kvm version and virtio-win driver version? From your qemu cmd line, you only use 2 cpus, could you extend it and have a try?

celovasquesjr commented 2 months ago

Hi @xiagao,

The versions are as follows:

Guest version: So far, the issue has occurred on Windows Server 2019 and 2022 Host kernel: 5.15.0-107-generic QEMU-KVM version: 6.2.0 VirtIO version: Virtio-win-guest-tools 0.1.229

Regarding the suggestion to extend the number of CPUs, I’d like to emphasize that the boot issue related to UEFI does not appear when the machine is shut down and powered back on. I’m able to successfully boot the OS afterward. This issue is not easy to reproduce consistently. I’ve encountered it a few times when machines updated and rebooted overnight due to updates and scheduled tasks, and by the morning they were stuck on that screen. However, on other Windows machines, this problem has started happening more frequently, with just a 'trigger' from a reboot to cause the issue. Still, as I mentioned, it’s not easy to simulate—it sometimes happens, sometimes it doesn’t.

ybendito commented 2 months ago

@celovasquesjr I'd suggest to make a test of automatic system reboots with (probably) randomized delay before reboot to understand whether the problem is related to rbd disks or not. If the problem can be reproduced in such test with rbd and can't be reproduced in similar test with local image - this may narrow the problem source.

xiagao commented 2 months ago

I also hit a similar issue on Win10-64bit, 15/99 reproducible. @ybendito Could you have a look the similar issue in Jira, I @you there.

celovasquesjr commented 2 months ago

@ybendito I will perform the tests as requested and get back to you shortly once I have a response.

@xiagao Did you run your tests on RBD disks?

xiagao commented 2 months ago

@ybendito I will perform the tests as requested and get back to you shortly once I have a response.

@xiagao Did you run your tests on RBD disks?

No, I didn't. My test was on the local host with a qcow2 file as the disk.

celovasquesjr commented 1 month ago

Hi guys

I couldn't reproduce the issue on any of the RDB or Local disks.

Is there anything else I should check?

Could it be something related to the Windows Update in that specific update? I had issues when the machine automatically rebooted due to the Windows Update during the night.

However, the strange thing is that I also couldn't reproduce the issue by manually installing the updates and rebooting.

xiagao commented 1 month ago

Hi guys

I couldn't reproduce the issue on any of the RDB or Local disks.

Is there anything else I should check?

Could it be something related to the Windows Update in that specific update? I had issues when the machine automatically rebooted due to the Windows Update during the night.

However, the strange thing is that I also couldn't reproduce the issue by manually installing the updates and rebooting.

You could have a try with Win10 64bit guests no matter with local disk or rbd. It is possible to reproduce after some repeated reboots. While it's tough to reproduce on other Windows os in my side.