aarch64: LKRG: FATAL: Can't hook 'arch_jump_label_transform'

vt-alt commented 1 month ago

lkrg suddenly become broken on 6.6.45, 6.10.4, 6.1.104 (all released today stable kernels).

[00:00:19] ++ insmod lkrg.ko log_level=3
[00:00:19] [    1.032612] lkrg: loading out-of-tree module taints kernel.
[00:00:19] [    1.033327] lkrg: module verification failed: signature and/or required key missing - tainting kernel
[00:00:19] [    1.044763] LKRG: ALIVE: Loading LKRG
[00:00:19] [    1.045508] Freezing user space processes
[00:00:19] [    1.047050] Freezing user space processes completed (elapsed 0.001 seconds)
[00:00:19] [    1.047809] OOM killer disabled.
[00:00:19] [    1.051342] LKRG: ISSUE: [kretprobe] register_kretprobe() for <ovl_dentry_is_whiteout> failed! [err=-2]
[00:00:19] [    1.052404] LKRG: ISSUE: Can't hook 'ovl_dentry_is_whiteout'. This is expected when OverlayFS is not used.
[00:00:19] [    1.054388] LKRG: FATAL: [kretprobe] register_kretprobe() for <arch_jump_label_transform> failed! [err=-2]
[00:00:19] [    1.055477] LKRG: FATAL: Can't hook 'arch_jump_label_transform'
[00:00:19] [    1.056140] LKRG: FATAL: Can't register CPU architecture specific metadata
[00:00:19] [    1.056873] LKRG: FATAL: Can't create database
[00:00:19] [    1.057340] LKRG: DYING: Not loading LKRG (initialization failed)
[00:00:19] [    1.295735] WARNING: Flushing system-wide workqueues will be prohibited in near future.
[00:00:19] [    1.296641] CPU: 7 PID: 141 Comm: insmod Tainted: G           OE      6.6.45-un-def-alt1 #1
[00:00:19] [    1.297570] Hardware name: linux,dummy-virt (DT)
[00:00:19] [    1.298095] Call trace:
[00:00:19] [    1.298378]  dump_backtrace+0xa0/0x148
[00:00:19] [    1.298812]  show_stack+0x20/0x48
[00:00:19] [    1.299188]  dump_stack_lvl+0x48/0x78
[00:00:19] [    1.299603]  dump_stack+0x18/0x30
[00:00:19] [    1.299986]  __warn_flushing_systemwide_wq+0x24/0x48
[00:00:19] [    1.300542]  p_exploit_detection_exit+0x48/0xe0 [lkrg]
[00:00:19] [    1.301148]  p_lkrg_register+0x1a8/0xff8 [lkrg]
[00:00:19] [    1.301670]  do_one_initcall+0x4c/0x320
[00:00:19] [    1.302111]  do_init_module+0x60/0x228
[00:00:19] [    1.302534]  load_module+0x23a0/0x24b0
[00:00:19] [    1.302964]  init_module_from_file+0x90/0xf8
[00:00:19] [    1.303442]  __arm64_sys_finit_module+0x168/0x328
[00:00:19] [    1.303973]  invoke_syscall+0x78/0x128
[00:00:19] [    1.304397]  el0_svc_common.constprop.0+0x48/0x130
[00:00:19] [    1.304938]  do_el0_svc+0x24/0x50
[00:00:19] [    1.305315]  el0_svc+0x48/0x190
[00:00:19] [    1.305672]  el0t_64_sync_handler+0x140/0x150
[00:00:19] [    1.306167]  el0t_64_sync+0x194/0x198
[00:00:19] [    1.306693] WARNING: Flushing system-wide workqueues will be prohibited in near future.
[00:00:19] [    1.307610] CPU: 7 PID: 141 Comm: insmod Tainted: G           OE      6.6.45-un-def-alt1 #1
[00:00:19] [    1.308536] Hardware name: linux,dummy-virt (DT)
[00:00:19] [    1.309057] Call trace:
[00:00:19] [    1.309338]  dump_backtrace+0xa0/0x148
[00:00:19] [    1.309766]  show_stack+0x20/0x48
[00:00:19] [    1.310141]  dump_stack_lvl+0x48/0x78
[00:00:19] [    1.310555]  dump_stack+0x18/0x30
[00:00:19] [    1.310941]  __warn_flushing_systemwide_wq+0x24/0x48
[00:00:19] [    1.311495]  p_offload_cache_delete+0x24/0x68 [lkrg]
[00:00:19] [    1.312073]  p_lkrg_register+0x1b0/0xff8 [lkrg]
[00:00:19] [    1.312594]  do_one_initcall+0x4c/0x320
[00:00:19] [    1.313031]  do_init_module+0x60/0x228
[00:00:19] [    1.313453]  load_module+0x23a0/0x24b0
[00:00:19] [    1.313882]  init_module_from_file+0x90/0xf8
[00:00:19] [    1.314360]  __arm64_sys_finit_module+0x168/0x328
[00:00:19] [    1.314893]  invoke_syscall+0x78/0x128
[00:00:19] [    1.315315]  el0_svc_common.constprop.0+0x48/0x130
[00:00:19] [    1.315858]  do_el0_svc+0x24/0x50
[00:00:19] [    1.316233]  el0_svc+0x48/0x190
[00:00:19] [    1.316589]  el0t_64_sync_handler+0x140/0x150
[00:00:19] [    1.317081]  el0t_64_sync+0x194/0x198
[00:00:19] [    1.322821] OOM killer enabled.
[00:00:19] [    1.323177] Restarting tasks ... done.
[00:00:19] [    1.323696] printk: console [lkrg0] disabled
[00:00:19] insmod: ERROR: could not insert module lkrg.ko: Network dropped connection on reset
[00:00:19] [    1.364901] reboot: Power down

solardiz commented 1 month ago

Thank you for reporting this @vt-alt!

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.10.4 has this:

commit bea081b0d453f04cff2ffde7265a230065959901
Author: Will Deacon <will@kernel.org>
Date:   Wed Jul 31 14:36:01 2024 +0100

    arm64: jump_label: Ensure patched jump_labels are visible to all CPUs

    [ Upstream commit cfb00a35786414e7c0e6226b277d9f09657eae74 ]

    Although the Arm architecture permits concurrent modification and
    execution of NOP and branch instructions, it still requires some
    synchronisation to ensure that other CPUs consistently execute the newly
    written instruction:

     >  When the modified instructions are observable, each PE that is
     >  executing the modified instructions must execute an ISB or perform a
     >  context synchronizing event to ensure execution of the modified
     >  instructions

    Prior to commit f6cc0c501649 ("arm64: Avoid calling stop_machine() when
    patching jump labels"), the arm64 jump_label patching machinery
    performed synchronisation using stop_machine() after each modification,
    however this was problematic when flipping static keys from atomic
    contexts (namely, the arm_arch_timer CPU hotplug startup notifier) and
    so we switched to the _nosync() patching routines to avoid "scheduling
    while atomic" BUG()s during boot.

    In hindsight, the analysis of the issue in f6cc0c501649 isn't quite
    right: it cites the use of IPIs in the default patching routines as the
    cause of the lockup, whereas stop_machine() does not rely on IPIs and
    the I-cache invalidation is performed using __flush_icache_range(),
    which elides the call to kick_all_cpus_sync(). In fact, the blocking
    wait for other CPUs is what triggers the BUG() and the problem remains
    even after f6cc0c501649, for example because we could block on the
    jump_label_mutex. Eventually, the arm_arch_timer driver was fixed to
    avoid the static key entirely in commit a862fc2254bd
    ("clocksource/arm_arch_timer: Remove use of workaround static key").

    This all leaves the jump_label patching code in a funny situation on
    arm64 as we do not synchronise with other CPUs to reduce the likelihood
    of a bug which no longer exists. Consequently, toggling a static key on
    one CPU cannot be assumed to take effect on other CPUs, leading to
    potential issues, for example with missing preempt notifiers.

    Rather than revert f6cc0c501649 and go back to stop_machine() for each
    patch site, implement arch_jump_label_transform_apply() and kick all
    the other CPUs with an IPI at the end of patching.

    Cc: Alexander Potapenko <glider@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Marc Zyngier <maz@kernel.org>
    Fixes: f6cc0c501649 ("arm64: Avoid calling stop_machine() when patching jump labels")
    Signed-off-by: Will Deacon <will@kernel.org>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Marc Zyngier <maz@kernel.org>
    Link: https://lore.kernel.org/r/20240731133601.3073-1-will@kernel.org
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

The referenced mainline commit is:

commit cfb00a35786414e7c0e6226b277d9f09657eae74
Author: Will Deacon <will@kernel.org>
Date:   Wed Jul 31 14:36:01 2024 +0100

    arm64: jump_label: Ensure patched jump_labels are visible to all CPUs

    Although the Arm architecture permits concurrent modification and
    execution of NOP and branch instructions, it still requires some
    synchronisation to ensure that other CPUs consistently execute the newly
    written instruction:                                          

     >  When the modified instructions are observable, each PE that is
     >  executing the modified instructions must execute an ISB or perform a
     >  context synchronizing event to ensure execution of the modified
     >  instructions                                              

    Prior to commit f6cc0c501649 ("arm64: Avoid calling stop_machine() when
    patching jump labels"), the arm64 jump_label patching machinery
    performed synchronisation using stop_machine() after each modification,
    however this was problematic when flipping static keys from atomic
    contexts (namely, the arm_arch_timer CPU hotplug startup notifier) and
    so we switched to the _nosync() patching routines to avoid "scheduling
    while atomic" BUG()s during boot.

    In hindsight, the analysis of the issue in f6cc0c501649 isn't quite
    right: it cites the use of IPIs in the default patching routines as the
    cause of the lockup, whereas stop_machine() does not rely on IPIs and
    the I-cache invalidation is performed using __flush_icache_range(),
    which elides the call to kick_all_cpus_sync(). In fact, the blocking
    wait for other CPUs is what triggers the BUG() and the problem remains
    even after f6cc0c501649, for example because we could block on the
    jump_label_mutex. Eventually, the arm_arch_timer driver was fixed to
    avoid the static key entirely in commit a862fc2254bd
    ("clocksource/arm_arch_timer: Remove use of workaround static key").

    This all leaves the jump_label patching code in a funny situation on
    arm64 as we do not synchronise with other CPUs to reduce the likelihood
    of a bug which no longer exists. Consequently, toggling a static key on
    one CPU cannot be assumed to take effect on other CPUs, leading to
    potential issues, for example with missing preempt notifiers.

    Rather than revert f6cc0c501649 and go back to stop_machine() for each
    patch site, implement arch_jump_label_transform_apply() and kick all
    the other CPUs with an IPI at the end of patching.

    Cc: Alexander Potapenko <glider@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Marc Zyngier <maz@kernel.org>
    Fixes: f6cc0c501649 ("arm64: Avoid calling stop_machine() when patching jump labels")
    Signed-off-by: Will Deacon <will@kernel.org>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Marc Zyngier <maz@kernel.org>
    Link: https://lore.kernel.org/r/20240731133601.3073-1-will@kernel.org
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
index 4e753908b801..a0a5bbae7229 100644
--- a/arch/arm64/include/asm/jump_label.h
+++ b/arch/arm64/include/asm/jump_label.h
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <asm/insn.h>

+#define HAVE_JUMP_LABEL_BATCH
 #define JUMP_LABEL_NOP_SIZE            AARCH64_INSN_SIZE

 #define JUMP_TABLE_ENTRY(key, label)                   \
diff --git a/arch/arm64/kernel/jump_label.c b/arch/arm64/kernel/jump_label.c
index faf88ec9c48e..f63ea915d6ad 100644
--- a/arch/arm64/kernel/jump_label.c
+++ b/arch/arm64/kernel/jump_label.c
@@ -7,11 +7,12 @@
  */
 #include <linux/kernel.h>
 #include <linux/jump_label.h>
+#include <linux/smp.h>
 #include <asm/insn.h>
 #include <asm/patching.h>

-void arch_jump_label_transform(struct jump_entry *entry,
-                              enum jump_label_type type)
+bool arch_jump_label_transform_queue(struct jump_entry *entry,
+                                    enum jump_label_type type)
 {
        void *addr = (void *)jump_entry_code(entry);
        u32 insn;
@@ -25,4 +26,10 @@ void arch_jump_label_transform(struct jump_entry *entry,
        }

        aarch64_insn_patch_text_nosync(addr, insn);
+       return true;
+}
+
+void arch_jump_label_transform_apply(void)
+{
+       kick_all_cpus_sync();
 }

So it looks like the arch_jump_label_transform symbol is gone, also in mainline (but we didn't notice because our CI mainline test is for x86_64 only).

#define HAVE_JUMP_LABEL_BATCH disables usage of this symbol by kernel/jump_label.c.

solardiz commented 1 month ago

Interestingly, #define HAVE_JUMP_LABEL_BATCH is also defined on x86, and has been since 2019. But both kinds of symbols are defined on x86. So our hooking succeeds, but are we hooking the right function, one that's actually used where we care? Probably we are: also since 2019, starting with 29867f6ecdd487940454433b6c6b04529412b808 we also hook arch_jump_label_transform_apply.

So should we simply stop hooking arch_jump_label_transform on newer kernels? Only on aarch64 or also on x86?

@Adam-pi3 Please take a look.

vt-alt commented 1 month ago

also in mainline (but we didn't notice because our CI mainline test is for x86_64 only).

BTW, GA got arm64 runners in public beta "These runners are available to our customers on our GitHub Team and Enterprise Cloud plans. We expect to begin offering Arm runners for open source projects by the end of the year. "

vt-alt commented 1 month ago

Is there work on this? [Because in ALT, building a kernel depends on the buildability of all modules, I will be forced to delete the LKRG kernel module temporarily to allow kernel updates, which users expect every week. This is normal and non-permanent (and not threatening), but in the meantime, LKRG will not be tested there. Perhaps the LKRG users will choose not to update the kernel until the module reappears.]

solardiz commented 1 month ago

Yes, I'll experiment with this myself some more and likely propose a fix unless @Adam-pi3 beats me to it.

solardiz commented 1 month ago

Looks like I won't fix this without @Adam-pi3's involvement, at least not as a quick fix that I was hoping I could have. The main reason why not is our hook for arch_jump_label_transform_apply checks for specific x86 opcodes, so it is insufficient to enable this hook - we also need to add aarch64-specific checks into the function.

A smaller issue is that while it looks like on current kernels we don't need to hook arch_jump_label_transform at all, on some older kernels with batch support (so between 2019 and now) it could also be used via arch_jump_label_transform_static. Looks like this was the case until this commit:

commit 7e6b9db27de9f69a705c1a046d45882c768e16c3
Author: Ard Biesheuvel <ardb@kernel.org>
Date:   Wed Jun 15 17:41:42 2022 +0200

    jump_label: make initial NOP patching the special case

which includes:

+++ b/kernel/jump_label.c
@@ -332,17 +332,13 @@ static int __jump_label_text_reserved(struct jump_entry *iter_start,
        return 0;
 }

-/*
- * Update code which is definitely not currently executing.
- * Architectures which need heavyweight synchronization to modify
- * running code can override this to make the non-live update case
- * cheaper.
- */
-void __weak __init_or_module arch_jump_label_transform_static(struct jump_entry *entry,
-                                           enum jump_label_type type)
+#ifndef arch_jump_label_transform_static
+static void arch_jump_label_transform_static(struct jump_entry *entry,
+                                            enum jump_label_type type)
 {
-       arch_jump_label_transform(entry, type);
+       /* nothing to do on most architectures */
 }
+#endif

So maybe our logic should be a bit trickier than disabling the hooking of arch_jump_label_transform if batch mode is supported.

OTOH, I think we can simplify this check in p_arch_jump_label_transform_apply.h:

#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,3,0) || \
   (defined(RHEL_RELEASE_CODE) && RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(8, 2))

to:

#ifdef HAVE_JUMP_LABEL_BATCH

instead of complicating it with a different kernel version check on aarch64 now.

Adam-pi3 commented 1 month ago

So should we simply stop hooking arch_jump_label_transform on newer kernels? Only on aarch64 or also on x86?

Some distros with older kernels didn't switch to the BATCH hooking but other old kernels did. It's a mixup scenario so we still need to keep both logic. I'm not sure if distros ported HAVE_JUMP_LABEL_BATCH - maybe we can simplify the check with this macro - interesting.

The main reason why not is our hook for arch_jump_label_transform_apply checks for specific x86 opcodes

It looks like that ARM version of batch hooking is waaaaay more simplified comparing to x86. I guess we can try hooking arch_jump_label_transform_queue for ARM batch mode and run the same logic as for arch_jump_label_transform hook. I will try to set-up my raspberry PI and play with it when I have time. The problem is I'm going to Poland next weekend which may affect my schedule. However, from the analysis it looks like what I described should work

solardiz commented 1 month ago

try hooking arch_jump_label_transform_queue for ARM batch mode and run the same logic as for arch_jump_label_transform hook.

I haven't yet looked into the detail of this, but my understanding was that queue would have been too early, which is why you chose to hook apply instead. And I doubt that our hooking for batch mode support should differ between archs, because Linux's own usage of these functions does not differ between archs except by HAVE_JUMP_LABEL_BATCH.

But maybe on aarch64 we can in fact run the same logic as for transform on transform_apply.

Adam-pi3 commented 1 month ago

I was surprised but on x86 that's the case but on ARM the logic is a bit different. Queue does the memory modification:

bool arch_jump_label_transform_queue(struct jump_entry *entry,
                     enum jump_label_type type)
{
...
    aarch64_insn_patch_text_nosync(addr, insn);
...
}

where:

int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
...
    ret = aarch64_insn_write(tp, insn);
...
}

where:

int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
    return __aarch64_insn_write(addr, cpu_to_le32(insn));
}

and real modification is here:

static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
{
    void *waddr = addr;
    unsigned long flags = 0;
    int ret;

    raw_spin_lock_irqsave(&patch_lock, flags);
    waddr = patch_map(addr, FIX_TEXT_POKE0);

    ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);

    patch_unmap(FIX_TEXT_POKE0);
    raw_spin_unlock_irqrestore(&patch_lock, flags);

    return ret;
}

So copy_to_kernel_nofault will make the changes. While apply on ARM synchronizes the cores via memory barrier:

void arch_jump_label_transform_apply(void)
{
    kick_all_cpus_sync();
}

where:

/**
 * kick_all_cpus_sync - Force all cpus out of idle
 *
 * Used to synchronize the update of pm_idle function pointer. It's
 * called after the pointer is updated and returns after the dummy
 * callback function has been executed on all cpus. The execution of
 * the function can only happen on the remote cpus after they have
 * left the idle function which had been called via pm_idle function
 * pointer. So it's guaranteed that nothing uses the previous pointer
 * anymore.
 */
void kick_all_cpus_sync(void)
{
    /* Make sure the change is visible before we kick the cpus */
    smp_mb();
    smp_call_function(do_nothing, NULL, 1);
}
EXPORT_SYMBOL_GPL(kick_all_cpus_sync);

One problem may be when on SMP machine one core is running verification routine while other does the modification and the cores are not in sync yet. However, do not updating database after the queue will certainly be wrong so we should do it there. We need to play and see if this MB is really needed (in theory we can call synchronization by ourself)

Adam-pi3 commented 4 weeks ago

@vt-alt does ALT linux support Raspberry Pi 4? I'm trying to set-up testing env and Ubuntu works fine but their PPA kernels are broken - they release linux-headers-generic which is depended on 'linux-headers' which they didn't publish: https://kernel.ubuntu.com/mainline/v6.10.4/

dpkg: dependency problems prevent configuration of linux-headers-6.10.4-061004-generic:
 linux-headers-6.10.4-061004-generic depends on linux-headers-6.10.4-061004; however:
  Package linux-headers-6.10.4-061004 is not installed.

If ALT linux has pre-baked Raspberry Pi image which I can use and then pull the desired kernel version with dev/headers package to build LKRG, it would make my life a lot easier.

vt-alt commented 4 weeks ago

@vt-alt does ALT linux support Raspberry Pi 4?

I am told we have full support. For example any .img.xz from from this list https://nightly.altlinux.org/sisyphus-aarch64/tested/ they could be written on SD-card and etc. There is also "Simply Linux" but I would suggest Regular-s from link above as it's Sisyphus based which is our "unstable" repository with everything latest. I also try to get mainline kernels into Sisyphus but it's not finished yet. Linux kernel packages have difference that we don't install them with apt-get, but with update kernel tool. So as of today you would install 6.11-rc4 with update-kernel -t 6.11 --headers.

About Ubuntu mainline packages, we already do it successfully in CI scripts. Perhaps you should just install linux-headers .deb in the same command line as linux-image package itself.

Adam-pi3 commented 4 weeks ago

I created a test patch (https://github.com/lkrg-org/lkrg/pull/352) implementing the fix/logic described here. It looks like it works. However, I would NOT merge this PR yet. @vt-alt can you run some tests on your end too? @solardiz when @vt-alt also confirms it works I guess we can merge it and see if no new problems pops up. I tested that fix by running it on Raspberry Pi 4 and on 2 parallel screens I run:

screen 1: run the command while true; do echo 1 > /sys/kernel/debug/tracing/events/enable && echo 0 > /sys/kernel/debug/tracing/events/enable; done which enforces JUMP_LABEL batch mode screen 2: run the command while true; do sysctl lkrg.trigger=1; sleep 1; done to enforce integrity validation.

I let it run by a few hours and didn't see any problems so far... Probably we need a bit more comprehensive tests anyway.

vt-alt commented 3 weeks ago

Thanks. I will test it today.

vt-alt commented 3 weeks ago

I tested the fix from refs/pull/352/head with the test loops as described (from 13:41:31 to 15:55:12 UTC when my xcpu limit is reached), and there was no problems appeared.

Thanks much!

solardiz commented 3 weeks ago

I was surprised but on x86 that's the case but on ARM the logic is a bit different. Queue does the memory modification:

OK, but do we have to do things differently? Wouldn't it be more correct and safer to hook apply rather than queue? Just like the rest of the kernel uses these arch-specific functions without such knowledge of their internals, maybe we shouldn't either. This would probably also avoid the potential issue you described with us needing the memory barrier - if we hook return from apply, the barrier would have already occurred.

solardiz commented 3 weeks ago

However, do not updating database after the queue will certainly be wrong so we should do it there.

Why would postponing this to apply be wrong? Our verification is supposed to run after we've acquired a lock anyway, does the kernel acquire the same lock before jump label modifications? When does it release the lock - perhaps after apply?

lkrg-org / lkrg

aarch64: LKRG: FATAL: Can't hook 'arch_jump_label_transform' #351