Crash when computing IDT hash in old Xen PV guest

solardiz commented 2 years ago

We might not care as the issue does not affect currently supported Qubes, but this might be our bug/shortcoming affecting other systems as well, so to have it documented for later:

Trying to load LKRG on (no longer supported) Qubes 3.x paravirtualized kernels in a Xen VM, based on Linux 4.9.x or 4.14.x, instantly crashes in p_lkrg_fast_hash. The backtraces vary - sometimes go from init_module, other times in interrupt context. The issue is present in current LKRG, but it's reproducible just the same even on LKRG 0.0 (although to build versions 0.4 and below for this test, I had to add a missing #include line - fixed in 0.5+).

On Qubes OS 4.1 with a 5.10.x based kernel, also running paravirtualized, there's no issue - LKRG loads and appears to work just fine (except that it fails to locate lookup_fast, so LKRG won't enforce pCFI validation on 'lookup_fast').

For example, here's the log for current LKRG on a 4.9.x kernel:

[   99.197541] [p_lkrg] Loading LKRG...
[   99.197561] [p_lkrg] System does NOT support SMEP. LKRG can't enforce SMEP validation :(
[   99.197575] [p_lkrg] System does NOT support SMAP. LKRG can't enforce SMAP validation :(
[   99.200423] Freezing user space processes ... (elapsed 0.016 seconds) done.
[   99.691510] [p_lkrg] [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
[   99.691535] [p_lkrg] Can't hook 'ovl_create_or_link' function. This is expected if you are not using OverlayFS.
[   99.847945] BUG: unable to handle kernel paging request at ffff83035295e000
[   99.847968] IP: [<ffffffffc00ff30a>] p_lkrg_fast_hash+0x7a/0x210 [p_lkrg]
[   99.848004] PGD 62 [   99.848005] PUD 0 
[   99.848005] 
[   99.848005] Oops: 0001 [#1] SMP
[   99.848005] Modules linked in: p_lkrg(O+) fuse ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_rapl x86_pkg_temp_thermal coretemp xen_netfront crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr dummy_hcd udc_core xenfs xen_blkback xen_privcmd u2mfn(O) xen_blkfront
[   99.848005] CPU: 0 PID: 2469 Comm: insmod Tainted: G           O    4.9.56-21.pvops.qubes.x86_64 #1
[   99.848005] task: ffff88008eb29d80 task.stack: ffffc900022e4000
[   99.848005] RIP: e030:[<ffffffffc00ff30a>]  [<ffffffffc00ff30a>] p_lkrg_fast_hash+0x7a/0x210 [p_lkrg]
[   99.848005] RSP: e02b:ffffc900022e7b20  EFLAGS: 00010087
[   99.848005] RAX: 76babb3a016df97c RBX: 0000000000000000 RCX: 0722faec0df5344b
[   99.848005] RDX: 66b0ad39167df362 RSI: 0000000000000000 RDI: ffff83035295e000
[   99.848005] RBP: ffffc900022e7b40 R08: 744d97897d86513e R09: ffff83035295e000
[   99.848005] R10: ffff88009d361400 R11: ffff83035295f000 R12: 1834f0ec13e3235f
[   99.848005] R13: ffffffffc0133000 R14: ffffffffc0128050 R15: 0000000000000001
[   99.848005] FS:  0000706f65f09700(0000) GS:ffff880015000000(0000) knlGS:ffff880015000000
[   99.848005] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.848005] CR2: ffff83035295e000 CR3: 00000000ab34f000 CR4: 0000000000042660
[   99.848005] Stack:
[   99.848005]  ffff880015018ff0 a0d8c2d23dd47253 ffff88009d361400 0000000000000000
[   99.848005]  ffffc900022e7b90 ffffffffc0108a30 a0d8c2d23dd47253 ffff880015018f80
[   99.848005]  ffff88008eb29d80 ffff83035295e000 a0d8c2d23dd47253 0000000080050033
[   99.848005] Call Trace:
[   99.848005]  [<ffffffffc0108a30>] p_dump_x86_metadata+0x70/0x520 [p_lkrg]
[   99.848005]  [<ffffffffc0133000>] ? 0xffffffffc0133000
[   99.848005]  [<ffffffffc010912d>] p_dump_CPU_metadata+0xdd/0x1b0 [p_lkrg]
[   99.848005]  [<ffffffff8112191f>] generic_exec_single+0xaf/0x120
[   99.848005]  [<ffffffffc010a15f>] ? p_create_database+0x7f/0x460 [p_lkrg]
[   99.848005]  [<ffffffffc0109050>] ? p_uninstall_switch_idt_hook+0x80/0x80 [p_lkrg]
[   99.848005]  [<ffffffff81121a5c>] smp_call_function_single+0xcc/0x130
[   99.848005]  [<ffffffff8122081d>] ? __kmalloc+0x16d/0x210
[   99.848005]  [<ffffffffc010a1e6>] p_create_database+0x106/0x460 [p_lkrg]
[   99.848005]  [<ffffffffc013346c>] p_lkrg_register+0x46c/0x1000 [p_lkrg]
[   99.848005]  [<ffffffffc0133000>] ? 0xffffffffc0133000
[   99.848005]  [<ffffffff81002190>] do_one_initcall+0x50/0x190
[   99.848005]  [<ffffffff811bc5bf>] ? free_hot_cold_page+0x1af/0x2b0
[   99.848005]  [<ffffffff8121f37a>] ? kmem_cache_alloc_trace+0xea/0x1d0
[   99.848005]  [<ffffffff811b230f>] ? do_init_module+0x27/0x1ee
[   99.848005]  [<ffffffff811b2347>] do_init_module+0x5f/0x1ee
[   99.848005]  [<ffffffff81127ed6>] load_module+0x1766/0x1b30
[   99.848005]  [<ffffffff811248b0>] ? __symbol_put+0x60/0x60
[   99.848005]  [<ffffffff8112850f>] SYSC_finit_module+0xdf/0x110
[   99.848005]  [<ffffffff8112855e>] SyS_finit_module+0xe/0x10
[   99.848005]  [<ffffffff817e1037>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[   99.848005] Code: b8 73 65 74 79 62 64 65 74 48 b9 75 65 73 70 65 6d 6f 73 48 c1 e3 38 4c 31 c8 4d 31 c4 4c 31 ca 4c 31 c1 4c 39 df 74 61 49 89 f9 <4d> 8b 11 41 b8 02 00 00 00 4c 31 d0 48 01 d1 49 01 c4 48 c1 c2 
[   99.848005] RIP  [<ffffffffc00ff30a>] p_lkrg_fast_hash+0x7a/0x210 [p_lkrg]
[   99.848005]  RSP <ffffc900022e7b20>
[   99.848005] CR2: ffff83035295e000

Adam-pi3 commented 2 years ago

@solardiz Can you generate a kdump file and compile LKRG with debug build?

solardiz commented 2 years ago

Basically, IDT is unreadable:

[ 2645.301096] [p_lkrg] <p_get_cpus> online[4] possible[4] present[4] active[4] nr_cpu_ids[4]
[ 2645.301127] [p_lkrg] <p_create_database> p_db.p_CPU_metadata_array[0xffff8800a7918a00] with requested size[448] = sizeof(p_CPU_metadata_hash_mem)[112] * p_db.p_cpu.p_nr_cpu_ids[4]
[ 2645.301333] [p_lkrg] <p_dump_IDT_MSR> CPU:[0] IDT => base[0xffff83035296f000] size[0x100] hash[0x0]
[ 2645.301363] [p_lkrg] Reading IDT 1 to verify data:[ 2645.301388] BUG: unable to handle kernel 
paging request at ffff83035296f01c
[ 2645.301419] IP: [<ffffffffc013ae98>] p_dump_x86_metadata+0x4d8/0x510 [p_lkrg]

This is with computation of IDT hash excluded (otherwise it'd crash right there) and loglevel=6 (so that we'd get that printout of IDT base, etc., and a crash in the following debugging code). Without loglevel=6, it'd stay up for a little while, but would see changing CPU metadata hash (apparently because the IDT base address is part of what's hashed and it somehow keeps changing?)

solardiz commented 2 years ago

Apparently, the returned IDT base is in hypervisor address range, not guest kernel, so is understandably unreadable from the kernel.

https://googleprojectzero.blogspot.com/2017/04/pandavirtualization-exploiting-xen.html

Adam-pi3 commented 2 years ago

I'm not an expert on PV, but in full virtualization, sidt is intercepted and can be modified by hypervisor. Hyper-V can 'fake' the result for sidt for UM (for security reasons to 'hide' info leak) but kernel must always be able to correctly set own IDT page and be able to migrate it. Even in Xen PV (based on the link which you pasted) kernel must be able to poke IDT so if this VA is shared with hypervisor, it is still visible (and should be RO) for the kernel. In fact, for the performance reasons it is very common to map hypervisor page with hypercall routine to the guest. Certainly I think this should be investigated more. Unless, the PV Linux image is modified in such a way that sidt is rewritten to something else. Can we check the kernel image to see if there are special modification for it?

solardiz commented 2 years ago

Intercepting sidt requires UMIP support (in CPU and Xen), which is possibly how/why LKRG just works on the newer system (Xen is documented to have added UMIP support between versions on these systems).

Apparently, in PV kernel IDT updates go via a hypercall. IDT address reads currently don't (no longer do?), but they don't appear to be followed by reads of the actual IDT (so LKRG differs from what Linux itself does there, unless I overlooked something). You can grep Linux kernel sources for load_idt, store_idt, HYPERVISOR_set_trap_table. BTW, in Linux around 4.10, there is:

./arch/x86/kernel/paravirt.c:   .store_idt = native_store_idt,
./arch/x86/include/asm/paravirt.h:static inline void store_idt(struct desc_ptr *dtr)
./arch/x86/include/asm/paravirt.h:      PVOP_VCALL1(pv_cpu_ops.store_idt, dtr);
./arch/x86/include/asm/paravirt_types.h:        void (*store_idt)(struct desc_ptr *);

These mentions of store_idt in *paravirt* are gone in current Linux.

solardiz commented 2 years ago

the IDT base address is part of what's hashed and it somehow keeps changing

Confirmed - on the affected system, each logical CPU's view of IDT base alternates between all CPUs' addresses. This actually makes sense because the CPUs aren't bound to physical ones, and sidt isn't virtualized here.

solardiz commented 2 years ago

Proposed fix (tested, works):

+++ b/src/modules/database/arch/x86/p_x86_metadata.c
@@ -126,8 +126,16 @@ void p_dump_x86_metadata(void *_p_arg) {
     */
    p_arg[p_curr_cpu].p_size = P_X86_MAX_IDT;

+#if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PVH)
+   if ((unsigned long)p_arg[p_curr_cpu].p_base >= 0xffff800000000000ULL &&
+       (unsigned long)p_arg[p_curr_cpu].p_base <= 0xffff87ffffffffffULL) {
+      p_arg[p_curr_cpu].p_base = 0;
+      p_arg[p_curr_cpu].p_size = 0;
+   }
+#endif
+
    p_arg[p_curr_cpu].p_hash = p_lkrg_fast_hash((unsigned char *)p_arg[p_curr_cpu].p_base,
-                                               (unsigned int)sizeof(p_idt_descriptor) * P_X86_MAX_IDT);
+                                               sizeof(p_idt_descriptor) * p_arg[p_curr_cpu].p_size);

 // DEBUG
 #ifdef P_LKRG_DEBUG
@@ -135,6 +143,7 @@ void p_dump_x86_metadata(void *_p_arg) {
           "<p_dump_IDT_MSR> CPU:[%d] IDT => base[0x%lx] size[0x%x] hash[0x%llx]\n",
           p_arg[p_curr_cpu].p_cpu_id,p_arg[p_curr_cpu].p_base,p_arg[p_curr_cpu].p_size,p_arg[p_curr_cpu].p_hash);

+   if (p_arg[p_curr_cpu].p_size)
    do {
       p_idt_descriptor *p_test;

solardiz commented 2 years ago

BTW, I still don't know whether this issue only affects old Xen or also recent Xen on old CPUs (without UMIP). This distinction would only matter for us to more specifically document the issue/fix. Maybe @adrelanos has comments to enable that?

solardiz commented 2 years ago

Also, I think Xen still supports 32-bit PV guests, but my fix above is only for 64-bit. We'd need to test for a different address range on 32-bit, or maybe disable IDT checking unconditionally when CONFIG_XEN_PVH yet the kernel is 32-bit.

Adam-pi3 commented 2 years ago

Looks like exactly how I expected that all 'sidt' instructions are modified in PV but our LKRG is not aware about it. Based on: https://wiki.xenproject.org/wiki/X86_Paravirtualised_Memory_Management

A Xen guest cannot access the Interrupt Descriptor Table (IDT) directly. Instead Xen maintains the IDT used by the physical hardware and provides guests with a completely virtual IDT. A guest writes entries to its virtual IDT using the [HYPERVISOR_set_trap_table](http://xenbits.xen.org/docs/unstable/hypercall/include,public,arch-x86,xen.h.html#Func_HYPERVISOR_set_trap_table) hypercall. This has the following prototype:
...
The entries of the trap_info struct correspond to the fields of a native IDT entry and each will be validated by Xen before it is used. The hypercall takes an array of traps terminated by an entry where address is zero.

Moreover, the same story is for GDT/LDT. In theory we could invoke hypercall to get necessary IDT information instead of not verifying it at all - but I'm not sure if it is not too much at this stage.

lkrg-org / lkrg

Crash when computing IDT hash in old Xen PV guest #159