virt-pvm / linux

Linux kernel source tree for PVM
Other
11 stars 2 forks source link

PVM guest kernel start failed with virtio-pmem rootfs #1

Open zhuangel opened 6 months ago

zhuangel commented 6 months ago

Description

Follow the Guide, build PVM host kernel and PVM guest kernel, start VM via Cloud Hypervisor, I can boot VM with default configurations, but if I change the rootfs into pmem, system boot failed. There are such failed messages.

[ 0.994008] Run /sbin/init as init process [ 1.001957] Starting init: /sbin/init exists but couldn't execute it (error -14) [ 1.003975] Run /etc/init as init process [ 1.005183] EXT4-fs error (device pmem0p1): __ext4_find_entry:1684: inode #41: comm init: reading directory lblock 0 [ 1.008044] Starting init: /etc/init exists but couldn't execute it (error -5) [ 1.010054] Run /bin/init as init process [ 1.011261] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 1.014336] Starting init: /bin/init exists but couldn't execute it (error -5) [ 1.016247] Run /bin/sh as init process [ 1.017334] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 1.020439] Starting init: /bin/sh exists but couldn't execute it (error -5) [ 1.022364] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. [ 1.026248] CPU: 0 PID: 1 Comm: init Not tainted 6.7.0-rc6-virt-pvm-guest+ #55 [ 1.028268] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0 [ 1.030055] Call Trace: [ 1.030879] [ 1.031556] dump_stack_lvl+0x43/0x60 [ 1.032689] panic+0x2b2/0x2d0 [ 1.033597] ? rest_init+0xc0/0xc0 [ 1.034551] kernel_init+0x112/0x120 [ 1.035626] ret_from_fork+0x2b/0x40 [ 1.036606] ? rest_init+0xc0/0xc0 [ 1.037537] ret_from_fork_asm+0x11/0x20 [ 1.038627] [ 1.039490] Kernel Offset: disabled [ 1.040556] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

Step to reproduce

  1. Build PVM host kernel and PVM guest kernel Following the guide pvm-get-started-with-kata.md

  2. Guest VM resource from Guide cloud-hypervisor v37 VM image from Guide

  3. Start VM and Create snapshot #Start VM failed with PMEM rootfs cloud-hypervisor.v37 \ --api-socket ch.sock \ --log-file vmm.log \ --cpus boot=1 \ --kernel vmlinux.virt-pvm-guest \ --cmdline 'console=ttyS0 root=/dev/pmem0p1 rw clocksource=kvm-clock pti=off' \ --memory size=1G,hugepages=off,shared=false,prefault=off \ --pmem id=pmem_0,discard_writes=on,file=ubuntu-22.04-pvm-kata.raw \ -v --console off --serial tty

    #Start VM success with Virtio Block rootfs cloud-hypervisor.v37 \ --api-socket ch.sock \ --log-file vmm.log \ --cpus boot=1 \ --kernel vmlinux.virt-pvm-guest \ --cmdline 'console=ttyS0 root=/dev/vda1 rw clocksource=kvm-clock pti=off' \ --memory size=1G,hugepages=off,shared=false,prefault=off \ --disk id=disk_0,path=ubuntu-22.04-pvm-kata.raw \ -v --console off --serial tty

  4. Configurations to work around the issue After some investigation, I found the issue caused by the MMIO address space provided by Cloud Hypervisor, virtio-pmem device is using MMIO address, which is at the end of physical address space And this will exceed PVM guest virtual default address space, then trigger some unexpected fails. So after I change CPU configuration into --cpus boot=1,max_phys_bits=43, I can boot the VM with rootfs on virto-pmem.

zhuangel commented 6 months ago

full boot log

bysui commented 6 months ago

Thanks for your debugging and report!

After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :(

Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance.

Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room.

I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

zhuangel commented 5 months ago

Thanks for your debugging and report!

After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :(

Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance.

Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room.

I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

Thanks for confirming.

Cloud Hypervisor initial guest memory layout base ‘ max_phys_bits’ and host physical bits, this will make MMIO address space at end of physical address, then virtio-pmem device will allocate MMIO address range from it.

My workaround is limiting Cloud Hypervisor MMIO address space inside PVM direct mapping area, then work around the panic in later file read from virtio-pmem device failed issue.

bysui commented 5 months ago

Thanks for your debugging and report! After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :( Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance. Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room. I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

Thanks for confirming.

Cloud Hypervisor initial guest memory layout base ‘ max_phys_bits’ and host physical bits, this will make MMIO address space at end of physical address, then virtio-pmem device will allocate MMIO address range from it.

My workaround is limiting Cloud Hypervisor MMIO address space inside PVM direct mapping area, then work around the panic in later file read from virtio-pmem device failed issue.

Actually, I believe the VMMs should use the CPUID information acquired from KVM to determine the number of guest physical bits. It is permissible to choose a smaller number of guest physical bits, but setting a larger number should be avoided. I'm not sure if Cloud Hypervisor directly retrieves the number of host physical bits using the CPUID instruction. If it does, then there might be a problem, for example, it may not take into account the reductions in MAXPHYADDR caused by memory encryption, which can affect shadow paging.