virt-pvm / linux

Linux kernel source tree for PVM

Other

14 stars 4 forks source link

PVM guest kernel hang on AMD virtual machine #3

Closed zhuangel closed 7 months ago

zhuangel commented 8 months ago

Description

Boot demo VM on AMD Zen 2 virtual machine (which PCID is disabled) hangs.

Step to reproduce

Build PVM host kernel and PVM guest kernel Following the guide pvm-get-started-with-kata.md, install PVM host kernel in AMD Zen 2 virtual machine.
PVM VM resource from Guide cloud-hypervisor v37 VM image from Guide
Start PVM VM Start PVM VM on AMD Zen 2 virtual machine cloud-hypervisor.v37 ` --api-socket ch.sock \ --log-file vmm.log \ --cpus boot=1 \ --kernel vmlinux.virt-pvm-guest \ --cmdline 'console=ttyS0 root=/dev/vda1 rw clocksource=kvm-clock pti=off' \ --memory size=1G,hugepages=off,shared=false,prefault=off \ --disk id=disk_0,path=ubuntu-22.04-pvm-kata.raw \ -v --console off --serial tty
The PVM VM hangs There is no output on serial, and VMM process CPU usage is very low(there is no dead loop), then I enable all kvm tracepoint, found there is msr_read emulate failed(index 0xc0011020, MSR_AMD64_LS_CFG), and PVM inject GP to PVM VM. Then PVM VM hangs in early_fixup_exception, because of the CS is __KERNEL32_CS.

vcpu0-3676813 [000] d..1. 275659.054772: kvm_exit: vcpu 0 reason GP excp rip 0xffffd97f81051232 info1 0x000000000000000d info2 0x0000000000000000 intr_info 0x0000000d error_code 0x00000000 vcpu0-3676813 [000] ..... 275659.054774: kvm_emulate_insn: 0:ffffd97f81051232:0f 32 (prot64) vcpu0-3676813 [000] ..... 275659.054774: kvm_msr: msr_read c0011020 = 0x0 (#GP) vcpu0-3676813 [000] ..... 275659.054775: kvm_inj_exception: #GP (0x0)

Workaround

I am not sure why __KERNEL32_CS in early_fixup_exception, maybe there should be some detection xen_pv_domain for PVM kernel, so that PVM kernel could finished fixup process. And I could workaround the issue with following fix

--- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -322,7 +322,7 @@ void __init early_fixup_exception(struct pt_regs *regs, int trapnr) * the 486 DX works this way. * Xen pv domains are not using the default __KERNEL_CS. */ - if (!xen_pv_domain() && regs->cs != __KERNEL_CS) + if (!xen_pv_domain() && regs->cs != __KERNEL_CS && regs->cs != __KERNEL32_CS) goto fail;

bysui commented 8 months ago

Thanks for your debugging and report!

Sorry, I don't have an AMD Zen 2 in my environment. However, I can easily reproduce the issue on an Intel platform.

In PVM, only PVM mode is allowed to run on the hardware, while non-PVM mode is emulated. When transitioning from non-PVM mode to PVM mode, the pvm->msr_star is set to the current CS selector in try_to_convert_to_pvm_mode(). And The CS pushed on the stack during kernel exception delivery is obtained from pvm->msr_star in do_pvm_supervisor_exception

According to the Linux boot protocol, when reaching startup_64(), the operating mode is already in 64-bit long mode, and the Rust VMM seems to set the CS selector as 1 * 8 for booting, which corresponds to __KERNEL32_CS in Linux. Therefore, the pvm->msr_star is set to the value of __KERNEL32_CS. Although the later booting code updates the CS selector to __KERNEL_CS, we don't update pvm->msr_star in the pvm_set_segment() callback. As a result, before the MSR_STAR setup, the pushed CS is always __KERNEL32_CS. Unfortunately, we haven't encountered any unsupported MSR during our testing before the real event entry setup.

The fix may look like this:

diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index 35bdcf9b977d..0f1af1c0e528 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -1318,6 +1318,9 @@ static void pvm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int
                                goto invalid_change;
                        if (cpl == 0 && !var->l)
                                pvm->non_pvm_mode = true;
+                       if (cpl == 0 && !pvm->non_pvm_mode)
+                               pvm->msr_star = ((u64)pvm->segments[VCPU_SREG_CS].selector << 32) |
+                                               ((u64)__USER32_CS << 48);
                }
                break;
        case VCPU_SREG_LDTR:

However, I will discuss the issue with Lai Jiangshan later to come up with a formal fix.

Champ-Goblem commented 8 months ago

Hi @bysui thank you for the AMD fixes, we have been trying PVM (from the pvm-fix branch) on GCP via the C3D machine type. All seemed to be okay when booting a pod in Kubernetes using PVM and Kata CLH, however when trying to run a workload (for example geekbench) the hypervisor crashes and we see a log from the host kernel at the time of the hypervisor crashing:

[ 2071.547461] kvm: kvm [10756]: vcpu0, guest rIP: 0xffffd97f818b33b9 Unhandled WRMSR(0xc0010007) = 0xffff

bysui commented 7 months ago

Hi @Champ-Goblem thank you for trying and testing PVM. We're glad to receive your feedback. Sorry for replying late. I have just returned from vacation.

I have tested Geekbench on my AMD machine using our VM images, but I wasn't able to reproduce the issue. The provided output is not a problem, as I can also see it in my environment. My assumption is that the cause may be related to wrmsrl_safe() during the initialization of the PMU during guest booting. Can you provide your host kernel config file and CPU information? I can try using them to reproduce the issue.

It is expected to see a panic and backtrace in the log when the host panics. If this is not the case, it indicates a bug in the host entries. You may try using Kdump to collect the dmesg log for further analysis.

The problem might be related to direct switching. You can try disabling direct switching by following:

diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index 37e8a19bc064..fc1a16b8955d 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -2726,6 +2726,8 @@ static fastpath_t pvm_vcpu_run(struct kvm_vcpu *vcpu)
        if (pvm->host_debugctlmsr)
                update_debugctlmsr(0);

+       pvm->switch_flags |= SWITCH_FLAGS_NO_DS_CR3;
+
        pvm_vcpu_run_noinstr(vcpu);

        if (is_smod_befor_run != is_smod(pvm)) {

Does your C3D machine support PCID and INVPCID? If it does and the problem still persists after disabling direct switching, then the issue may be related to PCID management. You can try flushing TLB on each VM entry by following:

diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index 37e8a19bc064..f627ff0c0aa8 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -752,8 +752,8 @@ static void pvm_set_host_cr3_for_guest_with_host_pcid(struct vcpu_pvm *pvm)
        u64 hw_cr3 = root_hpa | host_pcid;
        u64 switch_host_cr3;

-       if (!flush)
-               hw_cr3 |= CR3_NOFLUSH;
+       // if (!flush)
+       //      hw_cr3 |= CR3_NOFLUSH;

In addition, you can use perf to trace the kvm_exit tracepoint and obtain the last exit reason for further analysis.

Champ-Goblem commented 7 months ago

I hope you had a nice holiday! I have recreated the instance on GCP to check for the cases you have provided, however annoyingly I also do not seem to be able to recreate the issue now.

I will post the kernel config here anyway: config-6.7.0-rc6+.txt

The details on the CPU of this instance:

vendor_id   : AuthenticAMD
cpu family  : 25
model       : 17
model name  : AMD EPYC 9B14
stepping    : 1
microcode   : 0xffffffff
cpu MHz     : 2599.998
cache size  : 1024 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 16
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips    : 5199.99
TLB size    : 3584 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 52 bits physical, 57 bits virtual
power management:

zhuangel commented 7 months ago

Hi @bysui Thanks for the fix!

I have verified the issue on my test environment, after update host kernel code with pvm-fix branch, I could boot PVM guest kernel, and MSR emulated failures do not trigger GP injection into the guest kernel.

bysui commented 7 months ago

I hope you had a nice holiday! I have recreated the instance on GCP to check for the cases you have provided, however annoyingly I also do not seem to be able to recreate the issue now.

I will post the kernel config here anyway: config-6.7.0-rc6+.txt

The details on the CPU of this instance:

vendor_id : AuthenticAMD
cpu family    : 25
model     : 17
model name    : AMD EPYC 9B14
stepping  : 1
microcode : 0xffffffff
cpu MHz       : 2599.998
cache size    : 1024 KB
physical id   : 0
siblings  : 8
core id       : 3
cpu cores : 4
apicid        : 7
initial apicid    : 7
fpu       : yes
fpu_exception : yes
cpuid level   : 16
wp        : yes
flags     : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
bugs      : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips  : 5199.99
TLB size  : 3584 4K pages
clflush size  : 64
cache_alignment   : 64
address sizes : 52 bits physical, 57 bits virtual
power management:

Hi @Champ-Goblem, I tried using your host configuration, but I couldn't reproduce the issue either. However, please note that the CPU on my AMD machine is older than yours. If you encounter the issue again in the future, we can open an issue to track it. We will also conduct more tests on the AMD platform.

Thank you for testing PVM.