tenclass / mvisor-win-vgpu-driver

Implementation of OpenGL on windows guest virtual machine using Mesa/Virgl protocol.
GNU General Public License v3.0
114 stars 21 forks source link

BSOD when loading the driver: `SYSTEM_THREAD_EXCEPTION_NOT_HANDLED` #10

Closed w568w closed 1 month ago

w568w commented 1 month ago

1. Problem

The kernel BSODed when trying to load mvisor-win-vgpu-driver, due to an error Attempt to read from address 0000000000000008. BSOD code is SYSTEM_THREAD_EXCEPTION_NOT_HANDLED.

2. Steps to reproduce

  1. have a computer installed Archlinux x86_64;
  2. Create a empty .qcow2 with qemu-img: qemu-img -f qcow2 win.qcow2 80G;
  3. git clone https://github.com/tenclass/mvisor and build build/visor with the instructions in README;
  4. get the Windows 10 22H2 ISO (Full name: Windows 10 (consumer editions), version 22H2 (updated July 2024) (x64) - DVD (Chinese-Simplified)) from MSDN I Tell You.
    • Link: magnet:?xt=urn:btih:04c08aeaf5f6849b30cead6f722138d7ce1460c6&dn=zh-cn_windows_10_consumer_editions_version_22h2_updated_july_2024_x64_dvd_3245b006.iso&xl=7133401088
  5. fill config/sample.yaml with the following content:
config/sample.yaml

```yaml name: Default configuration base: q35.yaml machine: memory: 4G vcpu: 4 # Set vcpu thread priority value [-20, 19] # A higher value means a lower priority priority: 1 # Turn on BIOS output and performance measurement debug: No # Turn on hypervisor to lower CPU usage (Hyper-V is used for Windows) hypervisor: Yes objects: - name: cmos # gmtime for linux, localtime for windows rtc: localtime - class: qxl - class: spice-agent - class: usb-tablet - class: virtio-network mac: 00:50:00:11:22:33 map: tcp:0.0.0.0:8022-:22 - class: ata-cdrom image: /home/w568w/Downloads/win10.iso - class: ata-cdrom image: /home/w568w/Downloads/virtio-win-0.1.240.iso - class: virtio-block image: /home/w568w/win.qcow2 snapshot: No - class: virtio-vgpu memory: 1G staging: No blob: No node: /dev/dri/renderD129 ```

  1. run ./build/mvisor -c config/sample.yaml -vnc 5900 to install Windows normally;
  2. download release from https://github.com/tenclass/mvisor-win-vgpu-driver/releases/tag/v1.0.0, extract it in the guest Windows;
  3. run install.bat with admin permission. While installing the kernel driver, the screen immediately freezes and BSODs, and then restarts. .dmp file is dumped in C:/Windows/minidump.

3. Additional Information

I debugged a little with WinDbg and Ghidra, and I believe that the error is due to a broken Idrs[0].FreeIdList.

The error is NULL_CLASS_PTR_DEREFERENCE, and it seems that vgpu.sys+0x3753 instruction was trying to access zero address, which is likely in:

https://github.com/tenclass/mvisor-win-vgpu-driver/blob/45ab463193b68a4fef70707f3ade92800cc25c6f/kernelmode/vgpu/idr.c#L50-L63

This method is called by VirtioVgpuDeviceReleaseHardware in vgpu.c, according to the dump stacktrace.

I checked the disassembled codes of UnInitializeIdr:

  1. At +3748, LEA RBX, [0x14000c1b0] sets RBX to the address of &Idrs[0].FreeIdList static variable (i.e. RBX = &Idrs[0].FreeIdList), then jumps to 37af;
  2. At +37af, MOV RAX, qword ptr [RBX] reads the first integer from Idrs and saves to RAX, which should be RAX = Idrs[0].FreeIdList.Flink;
  3. Compare Idrs[0].FreeIdList.Flink and &Idrs[0].FreeIdList (which is 0x0 and 0x14000c1b0 respectively), and jump to +3753;
  4. At +3753, CMP qword ptr [RAX + 0x8], RBX reads from address RAX + 0x8, i.e. Idrs[0].FreeIdList.Flink->Blink, i.e. 0x0000000000000008, and the exception occurred.

The pseudocode is:

RBX = &Idrs[0].FreeIdList; // 0x14000c1b0
RAX = *RBX; // Idrs[0].FreeIdList.Flink, 0x0 (!!!)
if (RAX != RBX) { // check if Idrs[0].FreeIdList is not empty
    *(RAX + 8); // read Idrs[0].FreeIdList.Flink->Blink (to verify the list's consistency?), exception occurred!
}

4. Logs and dumps

Windows minidump: 080524-4140-01.dmp

System Information:

Kernel: Linux 6.10.3-x64v3-xanmod1
DE: KDE Plasma 6.1.3
WM: KWin (Wayland)
GPU 1: AMD Radeon Vega Series / Radeon Vega Mobile Series [Integrated]
GPU 2: NVIDIA GeForce GTX 1650 Mobile / Max-Q [Discrete]
nooodles2023 commented 1 month ago

Strange, VirtioVgpuDeviceReleaseHardware was only called when the driver is unloading. I guess you have not change you win10 to test-signing mode, so the windows unload it automatically when the Idrs have not been initialized! The driver v1.0.0 is release mode, "ASSERT(Idrs[i].Initilaized); " didn't work.

w568w commented 1 month ago

I guess you have not change you win10 to test-signing mode

That could not be true. I did enable it with bcdedit.exe /set testsigning on.

If I didn't, I won't even get the driver to run! The driver will be blocked during installation, and nothing happens.

The driver v1.0.0 is release mode, "ASSERT(Idrs[i].Initilaized); " didn't work.

Do you mean that I should compile a driver in the debug mode by myself?

nooodles2023 commented 1 month ago

The system may unload the driver due to the small memory. Attempt to allocate a larger memory size.

w568w commented 1 month ago

The system may unload the driver due to the small memory. Attempt to allocate a larger memory size.

I increased it to 8GB RAM. No luck. :(


I find my issue similar to #5 and try to compile the driver successfully by myself.

Both of his and my situations are listed here:

5:

  1. Install released driver with install.bat directly: BSOD
  2. Install released driver in QEMU: Code 39 (Driver Entry Point Not Found)
  3. Compile the driver himself: Unable to compile

Mine:

  1. Install released driver with install.bat directly: BSOD (SYSTEM_THREAD_EXCEPTION_NOT_HANDLED)
  2. Install released driver in QEMU: Nothing happens. Seems the driver is not loaded at all
  3. Compile the driver myself: Install normally, but not working
nooodles2023 commented 1 month ago

The driver only worked for Mvisor! I guess the key reason was in loading process. The windows kernel thought the driver got sth wrong, so kernel unloaded it and cause BSOD. I would test it on 22h2 tonight.

w568w commented 1 month ago

Update: I thought Code 39 is due to a too high version of the target NT kernel, so I decreased _NT_TARGET_VERSION to 19041 and recompile it.

Now the kernel driver installs and (seems?) loading successfully, and the device's status also becomes "Operating normally". But now I encounter #2 too. Same [5412] IOCTL_VIRTIO_VGPU_GET_CAPS failed=31 error.


I will close this issue, since the problem described here has been fixed. The later discussion will be in #2. Thanks a lot! :+1:

Solution: rebuild the kernel driver with a lower _NT_TARGET_VERSION (i.e. 19041. This is your only choice in Visual Studio 2022), and reinstall it. When installing, I met BSOD again, but after rebooting the device starts working normally.