Open solardiz opened 3 years ago
Hm... I understand that bug happens when LKRG was doing initialization. However, from the logs it looks completely unrelated to LKRG:
2021-05-12T22:05:09.6384729Z [ 10.349523] #PF: supervisor instruction fetch in kernel mode
**2021-05-12T22:05:09.6385430Z [ 10.335978] RIP: 0010:schedule+0x0/0xc0**
2021-05-12T22:05:09.6386299Z [ 10.335978] Code: e9 63 fd ff ff 0f 0b e9 6f fe ff ff 0f 0b e9 b6 fe ff ff e8 e2 6f ff ff 48 c7 c7 70 9e 3a 9f e8 84 fd 4d ff 66 0f 1f 44 00 00 <e8> bb d7 6e 21 55 48 89 e5 41 54 53 65 4c 8b 24 25 c0 7b 01 00 49
2021-05-12T22:05:09.6387281Z [ 10.349523] #PF: error_code(0x0010) - not-present page
from the logs it looks completely unrelated to LKRG:
While the crash isn't in LKRG, maybe LKRG had done something that caused it.
Also, I now think the "BUG: kernel NULL pointer dereference, address: 0000000000000000" diagnostic is wrong. We could want to see why the kernel would think so given these circumstances - if it's in fact wrong, this could be something to fix in the kernel. Further two lines are more likely correct:
[ 10.332983] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 10.335978] #PF: supervisor instruction fetch in kernel mode
[ 10.335978] #PF: error_code(0x0010) - not-present page
The failing instruction, if the Code
lines are correct, is a relative CALL. It's also interesting that this CALL is immediately followed (if it were to return) by a function prologue (push %rbp
and so on). Does this perhaps look like an ftrace probe?
LKRG does hook schedule
. My best guess is it was in the process of hooking it (via kretprobe getting optimized to ftrace) while it was invoked by the kworker
thread. My guess is the hooking is non-atomic, making the code inconsistent/broken for a moment. If so, how do we deal with that? Freeze all tasks, including kernel threads, for the duration of our hooking (and of removal of hooks on LKRG unload, too)? We already do that for userspace processes, but we do it by calling the corresponding kernel functions to freeze/thaw. Are there equivalents that would cover kernel threads?
Hm... Originally, I was thinking it might be related to the other problem (tracepoint) but I think you might be right that it is related to FTRACE. Yes, first instruction should be FTRACE dummy NOP replaced by optimized KRETPROBE hook. However, hooking should(is) atomic and kernel sends IPI to all cores to synchronize the new changes between the cores.
hooking should(is) atomic
I doubt it. There's this:
* text_poke - Update instructions on a live kernel
* @addr: address to modify
* @opcode: source of the copy
* @len: length to copy
*
* Only atomic text poke/set should be allowed when not doing early patching.
* It means the size must be writable atomically and the address must be aligned
* in a way that permits an atomic write. It also makes sure we fit on a single
* page.
Of course, a size of 5 doesn't meet "the size must be writable atomically", and in fact, if I read the code right, ftrace doesn't even appear to call the above function, instead calling:
/**
* text_poke_early - Update instructions on a live kernel at boot time
* @addr: address to modify
* @opcode: source of the copy
* @len: length to copy
*
* When you use this code to patch more than one byte of an instruction
* you need to make sure that other CPUs cannot execute this code in parallel.
* Also no thread must be currently preempted in the middle of these
* instructions. And on the local CPU you need to be protected against NMI or
* MCE handlers seeing an inconsistent instruction while you patch.
*/
void __init_or_module text_poke_early(void *addr, const void *opcode,
size_t len)
{
unsigned long flags;
if (boot_cpu_has(X86_FEATURE_NX) &&
is_module_text_address((unsigned long)addr)) {
/*
* Modules text is marked initially as non-executable, so the
* code cannot be running and speculative code-fetches are
* prevented. Just change the code.
*/
memcpy(addr, opcode, len);
} else {
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
sync_core();
[...]
}
}
As you can see, "you need to make sure that other CPUs cannot execute this code in parallel". I don't see ftrace doing that.
ftrace does use a 5-byte NOP_ATOMIC5
single instruction (I think some kind of dummy LEA with dummy segment override prefix) to interchange it with the CALL instruction, but that on its own is not enough to ensure the code is updated atomically. (I think they could actually read 8 bytes, update 5 of them in a register, and then write 8 bytes with one instruction. With proper alignment, the write could be atomic. But they don't do that.)
I think that on a real CPU, we'd probably hit the problem when one hardware thread of a core installs (or removes) the ftrace hook and another thread on the same core runs the code being patched - then it's the same cache line, so even byte-size updates can propagate individually. On QEMU, we can probably hit the problem for any combination of logical CPUs, if it doesn't simulate the cache.
Hm... from my understanding we don't go to text_poke_early
but to text_poke_queue
:
static int __ref
ftrace_modify_code_direct(unsigned long ip, const char *old_code,
const char *new_code)
{
int ret = ftrace_verify_code(ip, old_code);
if (ret)
return ret;
/* replace the text with the new text */
if (ftrace_poke_late)
text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
else
text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
return 0;
}
Before any modification in FTRACE happens, kernel must call:
int ftrace_arch_code_modify_prepare(void)
__acquires(&text_mutex)
{
/*
* Need to grab text_mutex to prevent a race from module loading
* and live kernel patching from changing the text permissions while
* ftrace has it set to "read/write".
*/
mutex_lock(&text_mutex);
ftrace_poke_late = 1;
return 0;
}
which forces ftrace_poke_late = 1
. I can see 1 corner case where it might not happens when a new module is coming (not fully formed yet). Similar problem like with the TRACEPOINT, but I added protection on self-loading (the bug is during self-load) to not perform any FTRACE operation, unless LKRG state is running. The corner case is in prepare_coming_module -> ftrace_module_enable
:
* so that we can modify the text.
*/
[1] if (ftrace_start_up)
ftrace_arch_code_modify_prepare();
...
int failed = __ftrace_replace_code(rec, 1);
...
when check at [1] fails we won't call it (ftrace_arch_code_modify_prepare
). Do you think that might happened?
Regardless of that corner case, text_poke_early
is problematic based on what you have analyzed / wrote. Maybe it is worth to send an email to the kernel / FTRACE owners?
We got something similar in https://github.com/openwall/lkrg/pull/94/checks?check_run_id=2632803616
[ OK ] Finished dracut cmdline hook.
Starting dracut pre-udev hook...
[ 9.274372] p_lkrg: loading out-of-tree module taints kernel.
[ 9.275583] p_lkrg: module verification failed: signature and/or required key missing - tainting kernel
[ 9.279985] [p_lkrg] Loading LKRG...
[ 9.280451] [p_lkrg] System does NOT support SMEP. LKRG can't enforce SMEP validation :(
[ 9.280879] [p_lkrg] System does NOT support SMAP. LKRG can't enforce SMAP validation :(
[ 9.355774] Freezing user space processes ... (elapsed 0.002 seconds) done.
[ 9.358815] OOM killer disabled.
[ 10.580500] [p_lkrg] [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
[ 10.581038] [p_lkrg] Can't hook 'ovl_create_or_link' function. This is expected if you are not using OverlayFS.
[ 10.720478] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 10.723729] #PF: supervisor instruction fetch in kernel mode
[ 10.723729] #PF: error_code(0x0010) - not-present page
[ 10.723729] PGD 0 P4D 0
[ 10.723729] Oops: 0010 [#1] SMP NOPTI
[ 10.723729] CPU: 1 PID: 29 Comm: kcompactd0 Tainted: G OE 5.13.0-051300rc2daily20210520-generic #202105192229
[ 10.723729] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 10.723729] RIP: 0010:0x0
[ 10.723729] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 10.723729] RSP: 0018:ffffad3a40103e28 EFLAGS: 00000082
[ 10.723729] RAX: 0000000000000000 RBX: ffff96ee01358000 RCX: 000000001c000001
[ 10.723729] RDX: 000000001c000000 RSI: 0000000000000246 RDI: ffff96ee3ed1cc00
[ 10.723729] RBP: ffffad3a40103e80 R08: 0000000000000001 R09: ffff96ee3ffdcd60
[ 10.723729] R10: ffff96ee3ffdcd60 R11: 0000000000000009 R12: 00000000fffee57e
[ 10.723729] R13: ffffad3a40103e38 R14: ffffad3a40103ea8 R15: ffff96ee3ffdcd58
[ 10.723729] FS: 0000000000000000(0000) GS:ffff96ee3ed00000(0000) knlGS:0000000000000000
[ 10.723729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.723729] CR2: ffffffffffffffd6 CR3: 000000003544a000 CR4: 00000000000006e0
[ 10.723729] Call Trace:
[ 10.723729] elfcorehdr_read+0x40/0x40
[ 10.723729] ? __bpf_trace_tick_stop+0x10/0x10
[ 10.723729] kcompactd+0x1aa/0x380
[ 10.723729] ? wait_woken+0x80/0x80
[ 10.723729] kthread+0x12f/0x150
[ 10.723729] ? kcompactd_do_work+0x250/0x250
[ 10.723729] ? __kthread_bind_mask+0x70/0x70
[ 10.723729] ret_from_fork+0x22/0x30
[ 10.723729] Modules linked in: p_lkrg(OE+) dm_mirror dm_region_hash dm_log virtio_rng autofs4
[ 10.723729] CR2: 0000000000000000
[ 10.723729] ---[ end trace cc94046c356af151 ]---
ABORT
[ 10.723729] RIP: 0010:0x0
[ 10.723729] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 10.723729] RSP: 0018:ffffad3a40103e28 EFLAGS: 00000082
[ 10.723729] RAX: 0000000000000000 RBX: ffff96ee01358000 RCX: 000000001c000001
[ 10.723729] RDX: 000000001c000000 RSI: 0000000000000246 RDI: ffff96ee3ed1cc00
[ 10.723729] RBP: ffffad3a40103e80 R08: 0000000000000001 R09: ffff96ee3ffdcd60
[ 10.723729] R10: ffff96ee3ffdcd60 R11: 0000000000000009 R12: 00000000fffee57e
[ 10.723729] R13: ffffad3a40103e38 R14: ffffad3a40103ea8 R15: ffff96ee3ffdcd58
[ 10.723729] FS: 0000000000000000(0000) GS:ffff96ee3ed00000(0000) knlGS:0000000000000000
[ 10.723729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.723729] CR2: ffffffffffffffd6 CR3: 000000003544a000 CR4: 00000000000006e0
[ 10.723729] Kernel panic - not syncing: Fatal exception
[ 10.723729] Kernel Offset: 0x4800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 10.723729] ACPI MEMORY or I/O RESET_REG.
@Adam-pi3 I'm not sure I'm reading the code right, but it appears that ftrace_poke_late
is only set briefly and then reset back to 0, and maybe it is never even attempted to be set (ftrace_arch_code_modify_prepare
never called) when k[ret]probes are set (optimized to ftrace).
One more: https://github.com/openwall/lkrg/runs/4485495918?check_suite_focus=true
[ 7.811314] [p_lkrg] Loading LKRG...
[ 7.812317] [p_lkrg] System does NOT support SMEP. LKRG can't enforce SMEP validation :(
[ 7.813514] [p_lkrg] System does NOT support SMAP. LKRG can't enforce SMAP validation :(
[ 7.823854] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 7.827086] OOM killer disabled.
[ 8.779960] [p_lkrg] [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-22]
[ 8.781100] [p_lkrg] Trying to find ISRA / CONSTPROP name for <ovl_create_or_link>
[ 9.018359] [p_lkrg] [kretprobe] register_kretprobe() for ovl_create_or_link failed and ISRA / CONSTPROP version not found!
[ 9.019783] [p_lkrg] Can't hook 'ovl_create_or_link' function. This is expected if you are not using OverlayFS.
[ 9.133262] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 9.134221] #PF: supervisor instruction fetch in kernel mode
[ 9.134895] #PF: error_code(0x0010) - not-present page
[ 9.135573] PGD 0 P4D 0
[ 9.135966] Oops: 0010 [#1] SMP NOPTI
[ 9.135966] CPU: 0 PID: 10 Comm: rcu_sched Tainted: G OE 5.8.0-25-generic #26-Ubuntu
[ 9.135966] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 9.135966] RIP: 0010:schedule+0x0/0xc0
[ 9.135966] Code: e9 63 fd ff ff 0f 0b e9 6f fe ff ff 0f 0b e9 b6 fe ff ff e8 e2 6f ff ff 48 c7 c7 70 9e 9a 9a e8 84 fd 4d ff 66 0f 1f 44 00 00 <e8> bb 07 25 26 55 48 89 e5 41 54 53 65 4c 8b 24 25 c0 7b 01 00 49
[ 9.135966] RSP: 0018:ffffacb20005fe08 EFLAGS: 00000286
[ 9.135966] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 9.135966] RDX: 000000000dc00000 RSI: 0000000000000246 RDI: 0000000000000246
[ 9.135966] RBP: ffffacb20005fe68 R08: 0000000000000001 R09: 0000000000000040
[ 9.135966] R10: ffffffff9b0697c0 R11: 0000000000000003 R12: 00000000fffee376
[ 9.135966] R13: ffffacb20005fe18 R14: 0000000000000002 R15: 0000000000000001
[ 9.135966] FS: 0000000000000000(0000) GS:ffff92777ec00000(0000) knlGS:0000000000000000
[ 9.135966] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.135966] CR2: 0000000000000000 CR3: 0000000035500000 CR4: 00000000000006f0
[ 9.135966] Call Trace:
[ 9.135966] ? schedule_timeout+0x8b/0x160
[ 9.135966] ? __next_timer_interrupt+0xe0/0xe0
ABORT
[ 9.135966] rcu_gp_fqs_loop+0xe9/0x2c0
[ 9.135966] rcu_gp_kthread+0x8d/0xf0
[ 9.135966] kthread+0x12f/0x150
[ 9.135966] ? rcu_gp_init+0x470/0x470
[ 9.135966] ? __kthread_bind_mask+0x70/0x70
[ 9.135966] ret_from_fork+0x22/0x30
[ 9.135966] Modules linked in: p_lkrg(OE+) dm_mirror dm_region_hash dm_log virtio_rng autofs4
[ 9.135966] CR2: 0000000000000000
[ 9.135966] ---[ end trace b59798badd78379a ]---
[ 9.135966] RIP: 0010:schedule+0x0/0xc0
[ 9.135966] Code: e9 63 fd ff ff 0f 0b e9 6f fe ff ff 0f 0b e9 b6 fe ff ff e8 e2 6f ff ff 48 c7 c7 70 9e 9a 9a e8 84 fd 4d ff 66 0f 1f 44 00 00 <e8> bb 07 25 26 55 48 89 e5 41 54 53 65 4c 8b 24 25 c0 7b 01 00 49
[ 9.135966] RSP: 0018:ffffacb20005fe08 EFLAGS: 00000286
[ 9.135966] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 9.135966] RDX: 000000000dc00000 RSI: 0000000000000246 RDI: 0000000000000246
[ 9.135966] RBP: ffffacb20005fe68 R08: 0000000000000001 R09: 0000000000000040
[ 9.135966] R10: ffffffff9b0697c0 R11: 0000000000000003 R12: 00000000fffee376
[ 9.135966] R13: ffffacb20005fe18 R14: 0000000000000002 R15: 0000000000000001
[ 9.135966] FS: 0000000000000000(0000) GS:ffff92777ec00000(0000) knlGS:0000000000000000
[ 9.135966] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.135966] CR2: 0000000000000000 CR3: 0000000035500000 CR4: 00000000000006f0
[ 9.135966] Kernel panic - not syncing: Fatal exception
[ 9.135966] Kernel Offset: 0x18600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 9.135966] ACPI MEMORY or I/O RESET_REG.
As seen in https://github.com/openwall/lkrg/runs/2570683885