zhuangel commented 6 months ago

Description

Follow the Guide, build PVM host kernel and PVM guest kernel, start VM via Cloud Hypervisor, I can boot VM with default configurations, but if I change the rootfs into pmem, system boot failed. There are such failed messages.

[ 0.994008] Run /sbin/init as init process [ 1.001957] Starting init: /sbin/init exists but couldn't execute it (error -14) [ 1.003975] Run /etc/init as init process [ 1.005183] EXT4-fs error (device pmem0p1): __ext4_find_entry:1684: inode #41: comm init: reading directory lblock 0 [ 1.008044] Starting init: /etc/init exists but couldn't execute it (error -5) [ 1.010054] Run /bin/init as init process [ 1.011261] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 1.014336] Starting init: /bin/init exists but couldn't execute it (error -5) [ 1.016247] Run /bin/sh as init process [ 1.017334] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 1.020439] Starting init: /bin/sh exists but couldn't execute it (error -5) [ 1.022364] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. [ 1.026248] CPU: 0 PID: 1 Comm: init Not tainted 6.7.0-rc6-virt-pvm-guest+ #55 [ 1.028268] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0 [ 1.030055] Call Trace: [ 1.030879] [ 1.031556] dump_stack_lvl+0x43/0x60 [ 1.032689] panic+0x2b2/0x2d0 [ 1.033597] ? rest_init+0xc0/0xc0 [ 1.034551] kernel_init+0x112/0x120 [ 1.035626] ret_from_fork+0x2b/0x40 [ 1.036606] ? rest_init+0xc0/0xc0 [ 1.037537] ret_from_fork_asm+0x11/0x20 [ 1.038627] [ 1.039490] Kernel Offset: disabled [ 1.040556] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

Step to reproduce

Build PVM host kernel and PVM guest kernel Following the guide pvm-get-started-with-kata.md
Guest VM resource from Guide cloud-hypervisor v37 VM image from Guide
Start VM and Create snapshot #Start VM failed with PMEM rootfs cloud-hypervisor.v37 \ --api-socket ch.sock \ --log-file vmm.log \ --cpus boot=1 \ --kernel vmlinux.virt-pvm-guest \ --cmdline 'console=ttyS0 root=/dev/pmem0p1 rw clocksource=kvm-clock pti=off' \ --memory size=1G,hugepages=off,shared=false,prefault=off \ --pmem id=pmem_0,discard_writes=on,file=ubuntu-22.04-pvm-kata.raw \ -v --console off --serial tty

#Start VM success with Virtio Block rootfs cloud-hypervisor.v37 \ --api-socket ch.sock \ --log-file vmm.log \ --cpus boot=1 \ --kernel vmlinux.virt-pvm-guest \ --cmdline 'console=ttyS0 root=/dev/vda1 rw clocksource=kvm-clock pti=off' \ --memory size=1G,hugepages=off,shared=false,prefault=off \ --disk id=disk_0,path=ubuntu-22.04-pvm-kata.raw \ -v --console off --serial tty
Configurations to work around the issue After some investigation, I found the issue caused by the MMIO address space provided by Cloud Hypervisor, virtio-pmem device is using MMIO address, which is at the end of physical address space And this will exceed PVM guest virtual default address space, then trigger some unexpected fails. So after I change CPU configuration into --cpus boot=1,max_phys_bits=43, I can boot the VM with rootfs on virto-pmem.

zhuangel commented 6 months ago

full boot log

/data/ch/cloud-hypervisor.v37 --api-socket ch.sock --log-file vmm.log --cpus boot=1 --kernel vmlinux.virt-pvm-guest --cmdline 'console=ttyS0 root=/dev/pmem0p1 rw clocksource=kvm-clock pti=off' --memory size=1G,hugepages=off,shared=false,prefault=off --pmem id=pmem_0,discard_writes=on,file=ubuntu-22.04-pvm-kata.raw --vsock id=vsock_0,cid=3,socket=ch.vsock -v --console off --serial tty [ 0.000000] Linux version 6.7.0-rc6-virt-pvm-guest+ (root@VM-231-14-centos) (gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), GNU ld version 2.35-5.tl3) #55 SMP Sun Mar 10 22:58:48 CST 2024 [ 0.000000] Command line: console=ttyS0 root=/dev/pmem0p1 rw clocksource=kvm-clock pti=off [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000e8000000-0x00000000f7ffffff] reserved [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] APIC: Static calls initialized [ 0.000000] SMBIOS 3.2.0 present. [ 0.000000] DMI: Cloud Hypervisor cloud-hypervisor, BIOS 0 [ 0.000000] Hypervisor detected: KVM [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000001] kvm-clock: using sched offset of 31394734 cycles [ 0.000006] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 0.000028] tsc: Detected 2394.374 MHz processor [ 0.000170] last_pfn = 0x40000 max_arch_pfn = 0x400000000 [ 0.000197] MTRR map: 0 entries (0 fixed + 0 variable; max 16), built from 8 variable MTRRs [ 0.000200] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT [ 0.001412] found SMP MP-table at [mem 0x000f0090-0x000f009f] [ 0.002656] Kernel/User page tables isolation: force enabled on kvm pvm guest. [ 0.002658] Using GB pages for direct mapping [ 0.002760] ACPI: Early table checksum verification disabled [ 0.002787] ACPI: RSDP 0x00000000000A0000 000024 (v02 CLOUDH) [ 0.002801] ACPI: XSDT 0x00000000000A13F1 00003C (v01 CLOUDH CHXSDT 00000001 RVAT 01000000) [ 0.002823] ACPI: FACP 0x00000000000A124F 000114 (v06 CLOUDH CHFACP 00000001 RVAT 01000000) [ 0.002846] ACPI: DSDT 0x00000000000A0024 00122B (v06 CLOUDH CHDSDT 00000001 RVAT 01000000) [ 0.002862] ACPI: APIC 0x00000000000A1363 000052 (v05 CLOUDH CHMADT 00000001 RVAT 01000000) [ 0.002878] ACPI: MCFG 0x00000000000A13B5 00003C (v01 CLOUDH CHMCFG 00000001 RVAT 01000000) [ 0.002885] ACPI: Reserving FACP table memory at [mem 0xa124f-0xa1362] [ 0.002886] ACPI: Reserving DSDT table memory at [mem 0xa0024-0xa124e] [ 0.002887] ACPI: Reserving APIC table memory at [mem 0xa1363-0xa13b4] [ 0.002888] ACPI: Reserving MCFG table memory at [mem 0xa13b5-0xa13f0] [ 0.003055] No NUMA configuration found [ 0.003056] Faking a node at [mem 0x0000000000000000-0x000000003fffffff] [ 0.003059] NODE_DATA(0) allocated [mem 0x3fffb000-0x3fffffff] [ 0.003128] Zone ranges: [ 0.003128] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.003130] DMA32 [mem 0x0000000001000000-0x000000003fffffff] [ 0.003131] Normal empty [ 0.003132] Device empty [ 0.003133] Movable zone start for each node [ 0.003134] Early memory node ranges [ 0.003134] node 0: [mem 0x0000000000001000-0x000000000009ffff] [ 0.003139] node 0: [mem 0x0000000000100000-0x000000003fffffff] [ 0.003141] Initmem setup node 0 [mem 0x0000000000001000-0x000000003fffffff] [ 0.003162] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.003419] On node 0, zone DMA: 96 pages in unavailable ranges [ 0.020933] ACPI: PM-Timer IO Port: 0x608 [ 0.021095] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23 [ 0.021105] ACPI: INT_SRC_OVR (bus 0 bus_irq 4 global_irq 4 dfl dfl) [ 0.021123] ACPI: Using ACPI (MADT) for SMP configuration information [ 0.021124] TSC deadline timer available [ 0.021125] smpboot: Allowing 1 CPUs, 0 hotplug CPUs [ 0.021142] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write() [ 0.021186] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff] [ 0.021187] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.021189] [mem 0x40000000-0xe7ffffff] available for PCI devices [ 0.021190] Booting paravirtualized kernel on KVM [ 0.021192] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns [ 0.025379] setup_percpu: NR_CPUS:240 nr_cpumask_bits:1 nr_cpu_ids:1 nr_node_ids:1 [ 0.027633] percpu: Embedded 46 pages/cpu s157288 r0 d31128 u2097152 [ 0.027674] Kernel command line: console=ttyS0 root=/dev/pmem0p1 rw clocksource=kvm-clock pti=off [ 0.027757] random: crng init done [ 0.028792] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear) [ 0.029313] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear) [ 0.029455] Fallback order for Node 0: 0 [ 0.029463] Built 1 zonelists, mobility grouping on. Total pages: 257792 [ 0.029464] Policy zone: DMA32 [ 0.029465] mem auto-init: stack:off, heap alloc:off, heap free:off [ 0.031107] Memory: 1004708K/1048188K available (12288K kernel code, 1516K rwdata, 1900K rodata, 3284K init, 1908K bss, 43436K reserved, 0K cma-reserved) [ 0.031161] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 [ 0.031183] Kernel/User page tables isolation: enabled [ 0.031468] rcu: Hierarchical RCU implementation. [ 0.031469] rcu: RCU event tracing is enabled. [ 0.031470] rcu: RCU restricting CPUs from NR_CPUS=240 to nr_cpu_ids=1. [ 0.031471] Tracing variant of Tasks RCU enabled. [ 0.031476] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies. [ 0.031477] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1 [ 0.031490] NR_IRQS: 15616, nr_irqs: 256, preallocated irqs: 0 [ 0.031600] rcu: srcu_init: Setting srcu_struct sizes based on contention. [ 0.031694] Console: colour dummy device 80x25 [ 0.031778] printk: legacy console [ttyS0] enabled [ 0.165166] ACPI: Core revision 20230628 [ 0.166250] APIC: Switch to symmetric I/O mode setup [ 0.168382] x2apic enabled [ 0.171229] APIC: Switched APIC routing to: physical x2apic [ 0.172781] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x228374dae5d, max_idle_ns: 440795268352 ns [ 0.175516] Calibrating delay loop (skipped) preset value.. 4788.74 BogoMIPS (lpj=9577496) [ 0.177912] x86/cpu: User Mode Instruction Prevention (UMIP) activated [ 0.179516] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 [ 0.179516] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 [ 0.179516] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization [ 0.179516] Spectre V2 : Mitigation: Retpolines [ 0.179516] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch [ 0.179516] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT [ 0.179516] RETBleed: WARNING: Spectre v2 mitigation leaves CPU vulnerable to RETBleed attacks, data leaks possible! [ 0.179516] RETBleed: Vulnerable [ 0.179516] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier [ 0.179516] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl [ 0.179516] MDS: Mitigation: Clear CPU buffers [ 0.179516] TAA: Mitigation: Clear CPU buffers [ 0.179516] MMIO Stale Data: Vulnerable: Clear CPU buffers attempted, no microcode [ 0.179516] GDS: Unknown: Dependent on hypervisor status [ 0.179516] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [ 0.179516] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [ 0.179516] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [ 0.179516] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' [ 0.179516] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' [ 0.179516] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' [ 0.179516] x86/fpu: Enabled xstate features 0xe7, context size is 2432 bytes, using 'compacted' format. [ 0.179516] Freeing SMP alternatives memory: 36K [ 0.179516] pid_max: default: 32768 minimum: 301 [ 0.179516] LSM: initializing lsm=capability,selinux [ 0.179516] SELinux: Initializing. [ 0.179516] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) [ 0.179516] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) [ 0.179516] smpboot: CPU0: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (family: 0x6, model: 0x55, stepping: 0x4) [ 0.180389] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1. [ 0.183542] Performance Events: unsupported p6 CPU model 85 no PMU driver, software events only. [ 0.187516] signal: max sigframe size: 3632 [ 0.187648] rcu: Hierarchical SRCU implementation. [ 0.189960] rcu: Max phase no-delay instances is 1000. [ 0.191896] smp: Bringing up secondary CPUs ... [ 0.194142] smp: Brought up 1 node, 1 CPU [ 0.195537] smpboot: Max logical packages: 1 [ 0.197606] smpboot: Total of 1 processors activated (4788.74 BogoMIPS) [ 0.200306] devtmpfs: initialized [ 0.202024] x86/mm: Memory block size: 128MB [ 0.203922] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns [ 0.207538] futex hash table entries: 256 (order: 2, 16384 bytes, linear) [ 0.211247] NET: Registered PF_NETLINK/PF_ROUTE protocol family [ 0.211687] audit: initializing netlink subsys (disabled) [ 0.214648] thermal_sys: Registered thermal governor 'step_wise' [ 0.214668] audit: type=2000 audit(1710127138.311:1): state=initialized audit_enabled=0 res=1 [ 0.219516] cpuidle: using governor menu [ 0.219597] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [ 0.222939] PCI: MMCONFIG for domain 0000 [bus 00-00] at [mem 0xe8000000-0xe80fffff] (base 0xe8000000) [ 0.223536] PCI: MMCONFIG at [mem 0xe8000000-0xe80fffff] reserved as E820 entry [ 0.227079] PCI: Using configuration type 1 for base access [ 0.228042] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages [ 0.231311] HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page [ 0.231536] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages [ 0.234794] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page [ 0.235768] cryptd: max_cpu_qlen set to 1000 [ 0.238412] ACPI: Added _OSI(Module Device) [ 0.239536] ACPI: Added _OSI(Processor Device) [ 0.241721] ACPI: Added _OSI(3.0 _SCP Extensions) [ 0.243535] ACPI: Added _OSI(Processor Aggregator Device) [ 0.246523] ACPI: 1 ACPI AML tables successfully acquired and loaded [ 0.247729] ACPI: _OSC evaluation for CPUs failed, trying _PDC [ 0.250553] ACPI: Interpreter enabled [ 0.251543] ACPI: PM: (supports S0 S5) [ 0.253342] ACPI: Using IOAPIC for interrupt routing [ 0.255516] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug [ 0.255535] PCI: Using E820 reservations for host bridge windows [ 0.259917] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00]) [ 0.262788] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3] [ 0.263540] acpi PNP0A08:00: _OSC: OS requested [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] [ 0.267516] acpi PNP0A08:00: _OSC: platform willing to grant [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] [ 0.267535] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_NOT_FOUND) [ 0.271516] acpiphp: Slot [0] registered [ 0.271594] acpiphp: Slot [1] registered [ 0.273540] acpiphp: Slot [2] registered [ 0.275498] acpiphp: Slot [3] registered [ 0.275577] acpiphp: Slot [4] registered [ 0.277512] acpiphp: Slot [5] registered [ 0.279586] acpiphp: Slot [6] registered [ 0.281577] acpiphp: Slot [7] registered [ 0.283516] acpiphp: Slot [8] registered [ 0.283593] acpiphp: Slot [9] registered [ 0.285536] acpiphp: Slot [10] registered [ 0.287516] acpiphp: Slot [11] registered [ 0.287577] acpiphp: Slot [12] registered [ 0.289583] acpiphp: Slot [13] registered [ 0.291516] acpiphp: Slot [14] registered [ 0.291592] acpiphp: Slot [15] registered [ 0.293574] acpiphp: Slot [16] registered [ 0.295516] acpiphp: Slot [17] registered [ 0.295577] acpiphp: Slot [18] registered [ 0.297578] acpiphp: Slot [19] registered [ 0.299516] acpiphp: Slot [20] registered [ 0.299577] acpiphp: Slot [21] registered [ 0.301574] acpiphp: Slot [22] registered [ 0.303516] acpiphp: Slot [23] registered [ 0.303577] acpiphp: Slot [24] registered [ 0.305558] acpiphp: Slot [25] registered [ 0.307516] acpiphp: Slot [26] registered [ 0.307577] acpiphp: Slot [27] registered [ 0.309568] acpiphp: Slot [28] registered [ 0.311516] acpiphp: Slot [29] registered [ 0.311577] acpiphp: Slot [30] registered [ 0.313535] acpiphp: Slot [31] registered [ 0.315516] PCI host bridge to bus 0000:00 [ 0.315536] pci_bus 0000:00: root bus resource [mem 0xe8000000-0xe80fffff] [ 0.318847] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xe7ffffff window] [ 0.319535] pci_bus 0000:00: root bus resource [mem 0x100000000-0x3ffeffffffff window] [ 0.323263] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window] [ 0.323535] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window] [ 0.326744] pci_bus 0000:00: root bus resource [bus 00] [ 0.327795] pci 0000:00:00.0: [8086:0d57] type 00 class 0x060000 [ 0.332308] pci 0000:00:01.0: [1af4:1044] type 00 class 0xffff00 [ 0.335516] pci 0000:00:01.0: reg 0x10: [mem 0x3ffdf3380000-0x3ffdf33fffff 64bit] [ 0.339516] pci 0000:00:02.0: [1af4:105b] type 00 class 0xffff00 [ 0.340034] pci 0000:00:02.0: reg 0x10: [mem 0x3ffdf3300000-0x3ffdf337ffff 64bit] [ 0.347325] pci 0000:00:03.0: [1af4:1053] type 00 class 0xffff00 [ 0.348031] pci 0000:00:03.0: reg 0x10: [mem 0x3ffdf3280000-0x3ffdf32fffff 64bit] [ 0.355516] iommu: Default domain type: Translated [ 0.355539] iommu: DMA domain TLB invalidation policy: lazy mode [ 0.358593] SCSI subsystem initialized [ 0.359542] pps_core: LinuxPPS API ver. 1 registered [ 0.361928] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti giometti@linux.it [ 0.363552] PTP clock support registered [ 0.365761] PCI: Using ACPI for IRQ routing [ 0.367595] vgaarb: loaded [ 0.368960] clocksource: Switched to clocksource kvm-clock [ 0.371516] pnp: PnP ACPI init [ 0.372479] system 00:00: [mem 0xe8000000-0xe80fffff] has been reserved [ 0.375778] pnp: PnP ACPI: found 2 devices [ 0.384562] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.388867] NET: Registered PF_INET protocol family [ 0.391648] IP idents hash table entries: 16384 (order: 5, 131072 bytes, linear) [ 0.395526] tcp_listen_portaddr_hash hash table entries: 512 (order: 1, 8192 bytes, linear) [ 0.399579] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear) [ 0.403328] TCP established hash table entries: 8192 (order: 4, 65536 bytes, linear) [ 0.407208] TCP bind hash table entries: 8192 (order: 6, 262144 bytes, linear) [ 0.411394] TCP: Hash tables configured (established 8192 bind 8192) [ 0.414897] UDP hash table entries: 512 (order: 2, 16384 bytes, linear) [ 0.418190] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes, linear) [ 0.421653] NET: Registered PF_UNIX/PF_LOCAL protocol family [ 0.424379] pci_bus 0000:00: resource 4 [mem 0xe8000000-0xe80fffff] [ 0.427367] pci_bus 0000:00: resource 5 [mem 0xc0000000-0xe7ffffff window] [ 0.430629] pci_bus 0000:00: resource 6 [mem 0x100000000-0x3ffeffffffff window] [ 0.434138] pci_bus 0000:00: resource 7 [io 0x0000-0x0cf7 window] [ 0.437122] pci_bus 0000:00: resource 8 [io 0x0d00-0xffff window] [ 0.440210] PCI: CLS 0 bytes, default 64 [ 0.442179] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x228374dae5d, max_idle_ns: 440795268352 ns [ 0.447124] platform rtc_cmos: registered platform RTC device (no PNP device found) [ 0.451652] workingset: timestamp_bits=40 max_order=18 bucket_order=0 [ 0.454980] fuse: init (API version 7.39) [ 0.457084] SGI XFS with security attributes, no debug enabled [ 0.460226] 9p: Installing v9fs 9p2000 file system support [ 0.463456] NET: Registered PF_ALG protocol family [ 0.465833] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249) [ 0.470316] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [ 0.474064] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0 [ 0.478080] ACPI: button: Power Button [PWRB] [ 0.480415] virtio-pci 0000:00:01.0: enabling device (0000 -> 0002) [ 0.485907] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002) [ 0.491425] virtio-pci 0000:00:03.0: enabling device (0000 -> 0002) [ 0.496978] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled [ 0.500494] 00:01: ttyS0 at I/O 0x3f8 (irq = 24, base_baud = 115200) is a 16550A [ 0.523904] brd: module loaded [ 0.529134] loop: module loaded [ 0.533549] nd_pmem namespace0.0: unable to guarantee persistence of writes [ 0.537280] VFIO - User Level meta-driver version: 0.3 [ 0.544109] intel_pstate: CPU model not supported [ 0.547011] xt_time: kernel timezone is -0000 [ 0.549195] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP) [ 0.552134] IPVS: Connection hash table configured (size=4096, memory=32Kbytes) [ 0.555794] IPVS: ipvs loaded. [ 0.557304] IPVS: [rr] scheduler registered. [ 0.559362] IPVS: [wrr] scheduler registered. [ 0.561487] IPVS: [lc] scheduler registered. [ 0.563560] IPVS: [wlc] scheduler registered. [ 0.565686] IPVS: [fo] scheduler registered. [ 0.567780] IPVS: [ovf] scheduler registered. [ 0.569855] IPVS: [lblc] scheduler registered. [ 0.572004] IPVS: [lblcr] scheduler registered. [ 0.574234] IPVS: [dh] scheduler registered. [ 0.576335] IPVS: [sh] scheduler registered. [ 0.578416] IPVS: [sed] scheduler registered. [ 0.580548] IPVS: [nq] scheduler registered. [ 0.582636] IPVS: [sip] pe registered. [ 0.584553] Initializing XFRM netlink socket [ 0.586639] NET: Registered PF_INET6 protocol family [ 0.608481] Segment Routing with IPv6 [ 0.610308] In-situ OAM (IOAM) with IPv6 [ 0.612355] NET: Registered PF_PACKET protocol family [ 0.614824] 9pnet: Installing 9P2000 support [ 0.624104] NET: Registered PF_VSOCK protocol family [ 0.644172] IPI shorthand broadcast: enabled [ 0.646270] AVX2 version of gcm_enc/dec engaged. [ 0.652370] AES CTR mode by8 optimization enabled [ 0.655813] sched_clock: Marking stable (508141273, 143910539)->(1083029479, -430977667) [ 0.696453] alg: No test for fips(ansi_cprng) (fips_ansi_cprng) [ 0.846987] pmem0: p1 p14 p15 [ 0.848137] pmem0p1: error: unaligned partition for dax [ 0.849767] pmem0p1: error: unaligned partition for dax [ 0.851196] pmem0p1: error: unaligned partition for dax [ 0.855653] EXT4-fs (pmem0p1): recovery complete [ 0.857239] EXT4-fs (pmem0p1): mounted filesystem 3fe0b3c6-c5ee-4b61-a8db-fd686714bc06 r/w with ordered data mode. Quota mode: disabled. [ 0.860492] VFS: Mounted root (ext4 filesystem) on device 259:1. [ 0.862252] devtmpfs: mounted [ 0.865439] Freeing unused kernel image (initmem) memory: 3284K [ 0.867184] Write protecting the kernel read-only data: 14336k [ 0.870163] Freeing unused kernel image (rodata/data gap) memory: 148K [ 0.918535] x86/mm: Checked W+X mappings: passed, no W+X pages found. [ 0.920448] x86/mm: Checking user space page tables [ 0.967460] x86/mm: Checked W+X mappings: passed, no W+X pages found. [ 0.969389] Run /sbin/init as init process [ 0.977398] Starting init: /sbin/init exists but couldn't execute it (error -14) [ 0.979428] Run /etc/init as init process [ 0.980653] EXT4-fs error (device pmem0p1): __ext4_find_entry:1684: inode #41: comm init: reading directory lblock 0 [ 0.983640] Starting init: /etc/init exists but couldn't execute it (error -5) [ 0.985740] Run /bin/init as init process [ 0.986953] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 0.990263] Starting init: /bin/init exists but couldn't execute it (error -5) [ 0.992405] Run /bin/sh as init process [ 0.993486] EXT4-fs warning (device pmem0p1): dx_probe:822: inode #1553: lblock 0: comm init: error -5 reading directory block [ 0.996612] Starting init: /bin/sh exists but couldn't execute it (error -5) [ 0.998658] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. [ 1.002534] CPU: 0 PID: 1 Comm: init Not tainted 6.7.0-rc6-virt-pvm-guest+ #55 [ 1.004454] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0 [ 1.006337] Call Trace: [ 1.007135] [ 1.007789] dump_stack_lvl+0x43/0x60 [ 1.009027] panic+0x2b2/0x2d0 [ 1.010018] ? rest_init+0xc0/0xc0 [ 1.011047] kernel_init+0x112/0x120 [ 1.012148] ret_from_fork+0x2b/0x40 [ 1.013232] ? rest_init+0xc0/0xc0 [ 1.014245] ret_from_fork_asm+0x11/0x20 [ 1.015423] [ 1.016301] Kernel Offset: disabled [ 1.017334] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

bysui commented 6 months ago

Thanks for your debugging and report!

After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :(

Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance.

Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room.

I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

zhuangel commented 5 months ago

Thanks for your debugging and report!

After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :(

Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance.

Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room.

I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

Thanks for confirming.

Cloud Hypervisor initial guest memory layout base ‘ max_phys_bits’ and host physical bits, this will make MMIO address space at end of physical address, then virtio-pmem device will allocate MMIO address range from it.

My workaround is limiting Cloud Hypervisor MMIO address space inside PVM direct mapping area, then work around the panic in later file read from virtio-pmem device failed issue.

bysui commented 5 months ago

Thanks for your debugging and report! After upgrading my QEMU version (the old QEMU version I used for testing L0 was 5.2.0), I found the issue. I'm not sure why the old QEMU version limited the physical address bits to 40 when my host supports 46 bits. This limitation caused nested KVM-EPT to perform worse than nested KVM-PVM in my performance tests. :( Actually, I encountered a similar problem before when working on Dragonball (which is also based on Rust-VMM) and implementing exclusive guest support. I tried to override the CPUID information (leaf is 0x80000008) in KVM, but it didn't work. It seems that Cloud Hypervisor doesn't use the CPUID information from KVM either. However, in PVM, we use SPT, so the smaller physical address bits for guests don't impact performance. Currently, the PVM guest is located in the upper half of the address space, which leaves insufficient room for the guest kernel address space and reduces the direct mapping area. I'm not sure why no warning is reported when hotplugging the pmem memory. As mentioned in the cover letter, we will relocate the guest kernel into the low half eventually, where there will be enough room. I spoke to my colleague who works on Kata Containers to inquire if I can pass the 'max_phys_bits' parameter to the runtime or if it's possible to use a block device instead of a pmem device for the rootfs. Unfortunately, it can't be implemented by modifying the configuration file directly, as it requires code modifications. Therefore, I will document this and release a new image with the modified runtime at a later time.

Thanks for confirming.

Cloud Hypervisor initial guest memory layout base ‘ max_phys_bits’ and host physical bits, this will make MMIO address space at end of physical address, then virtio-pmem device will allocate MMIO address range from it.

My workaround is limiting Cloud Hypervisor MMIO address space inside PVM direct mapping area, then work around the panic in later file read from virtio-pmem device failed issue.

Actually, I believe the VMMs should use the CPUID information acquired from KVM to determine the number of guest physical bits. It is permissible to choose a smaller number of guest physical bits, but setting a larger number should be avoided. I'm not sure if Cloud Hypervisor directly retrieves the number of host physical bits using the CPUID instruction. If it does, then there might be a problem, for example, it may not take into account the reductions in MAXPHYADDR caused by memory encryption, which can affect shadow paging.

virt-pvm / linux

PVM guest kernel start failed with virtio-pmem rootfs #1

Description

Step to reproduce

full boot log