lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
414 stars 72 forks source link

LKRG causes high CPU on kworker when kernel is built with CONFIG_KFENCE=y #79

Open morfikov opened 3 years ago

morfikov commented 3 years ago

There's a new feature in kernel 5.12 called kfence. When it's enabled with the following confg (default):

CONFIG_HAVE_ARCH_KFENCE=y
CONFIG_KFENCE=y
CONFIG_KFENCE_STATIC_KEYS=y
CONFIG_KFENCE_SAMPLE_INTERVAL=100
CONFIG_KFENCE_NUM_OBJECTS=255
CONFIG_KFENCE_STRESS_TEST_FAULTS=0

there's no issues with the system. It works more or less in the same way as before. But when the LKRG module is loaded, the kworker/0:2-events process starts to consume high amount of CPU. In my system it's about 10% or about 20-25% of a single CPU core.

According to the kernel doc, kfence is a low-overhead sampling-based memory safety error detector and is designed to be enabled in production kernels.

solardiz commented 3 years ago

As I recall, kfence was seen in a backtrace in one of the crashes at https://github.com/openwall/lkrg/runs/2693541239 - and this makes me wonder whether Adam's addition of some cpu_relex(); calls today possibly addressed this issue too? @morfikov Can you re-test? Thanks!

morfikov commented 3 years ago

No, that didn't help.

Adam-pi3 commented 3 years ago

@morfikov can you dump the stack for the process[es] with high CPU usage? you can do it via cat /proc/PID/stack

If you can provide more than one process' stack it would be useful

morfikov commented 3 years ago

There's only one process that utilizes a lot of CPU when LKRG is loaded:

PID USER     GROUP     PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
152 root     root      20   0       0      0      0 R   7.6   0.0   1:45.35 kworker/0:2+events 

Sometimes the file is empty, but when it has some content then it looks like this:

# cat /proc/152/stack
[<0>] worker_thread+0xc6/0x3b0
[<0>] kthread+0x138/0x160
[<0>] ret_from_fork+0x22/0x30
Adam-pi3 commented 3 years ago

It looks like unrelated to LKRG. LKRG's verification routine runs on unbound queue, like:

root        7357  2.9  0.0      0     0 ?        I    22:05   0:00 [kworker/u256:6-events_unbound]
root        7359  4.6  0.0      0     0 ?        I    22:05   0:00 [kworker/u256:7-events_unbound]
root        7493  1.6  0.0      0     0 ?        I    22:05   0:00 [kworker/u256:8-events_unbound]
root        7502  0.6  0.0      0     0 ?        I    22:05   0:00 [kworker/u256:10-events_unbound]

Also, from the stack trace it looks like something run on the event queue (not unbound) but finished the work and nothing is there anymore. Sometime kernel keeps such processes if such queue runs relatively often (optimization algorithm to avoid constant creation and destruction of the process). If LKRG is executed and didn't finish the work, you would see on the stack trace something similar to:

[<0>] p_check_integrity+0xe4/0x1120 [p_lkrg]

Can you check if you have any unbound queue in the process list? Maybe under KFENCE when LKRG sends IPI the algorithm is somehow slower. However, it doesn't look like there is anything wrong from the LKRG perspective.

morfikov commented 3 years ago

The only difference in my system when this process consumes ~10% CPU is when LKRG is loaded. So the kernel is the same, the config is the same and the system state is also the same. Also, one of my friends confirmed that when kfence is enabled and the lkrg module is loaded, the process instantly causes high CPU usage on his machine too. So definitely this issue is related to LKRG somehow.

Can you check if you have any unbound queue in the process list?

Usually there's one or two of the unbound processes. Sometimes ps can't see any of them. But when it does, the pid is the same:

#  ps aux | grep -i unbound
root          31  0.1  0.0      0     0 ?        I    07:32   0:00 [kworker/u8:1-events_unbound]
root          32  0.1  0.0      0     0 ?        I    07:32   0:00 [kworker/u8:2-events_unbound]
root          33  0.0  0.0      0     0 ?        I    07:32   0:00 [kworker/u8:3-events_unbound]
root         167  0.4  0.0      0     0 ?        I    07:32   0:01 [kworker/u8:5-events_unbound]
root         170  0.0  0.0      0     0 ?        I    07:32   0:00 [kworker/u8:7-events_unbound]
Adam-pi3 commented 3 years ago

Can you dump the stack for each of the events_unbound processes when you see that issue?

Adam-pi3 commented 3 years ago

The good news is I think I know what's going on. The bad news is that we would need to debate if we can do anything here...

I've compiled the kernel with the same options which you mentioned and I was able to repro the issue. In my case, the problematic was the the kworker process associated with power management(?). At first, I thought that the problem maybe is related to the random events which LKRG uses to make verification unpredictable. However, when I disabled them, the problem was still there. I was able to catch the scratches of stack trace of the problematic kworker and I saw that:

[<0>] preempt_schedule_thunk+0x16/0x18
[<0>] kprobe_busy_end+0x33/0x40
[<0>] trampoline_handler+0x68/0x70
[<0>] kretprobe_trampoline+0x25/0x50

kprobe_busy_end is called when the process is being preempted / rescheduled and there is unfinished KPROBE handler associated with it. This was weird since there shouldn't be such overwhelming activity and I started to dig in more. I started to read how KFENCE really works and I read in the documentation this:

Guarded allocations are set up based on the sample interval. After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool (allocation sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To “gate” a KFENCE allocation through the main allocator’s fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE.

static branches are using *_JUMP_LABEL under-the-hood. This means that KFENCE must constantly overwriting kernel .text section and modify kernel itself for every sample interval which is set to:

CONFIG_KFENCE_SAMPLE_INTERVAL=100

I run LKRG under debug info mode and yes... I saw in the logs the flood of the info messages from the LKRG's JUMP_LABEL engine:

[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!

I dig in more and I saw that KFENCE constantly modifies the following functions:

[Mon May 31 00:10:51 2021] [0xffffffff814a674e] [__kmalloc_track_caller+0x14e/0x420]
[Mon May 31 00:10:51 2021] [0xffffffff814a2d83] [memcg_slab_post_alloc_hook+0x13/0x500]
[Mon May 31 00:10:51 2021] [0xffffffff814a82de] [kmem_cache_free+0x11e/0x6a0]
[Mon May 31 00:10:51 2021] [0xffffffff814cf94c] [uncharge_batch+0xfc/0x250]
[Mon May 31 00:10:51 2021] [0xffffffff814d1df6] [obj_cgroup_charge_pages+0x106/0x160]

I'm using SLUB allocator but similar logic is for any SL*B. In the code KFENCE ends-up here:

static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
{
#ifdef CONFIG_KFENCE_STATIC_KEYS
    if (static_branch_unlikely(&kfence_allocation_key))
#else
    if (unlikely(!atomic_read(&kfence_allocation_gate)))
#endif
        return __kfence_alloc(s, size, flags);
    return NULL;
}

If CONFIG_KFENCE_STATIC_KEYS is not defined, KFENCE shouldn't constantly modify .text section (however, I didn't test it). So that's the problem... KFENCE constantly modifies kernel itself, LKRG is hooked into the JUMP_LABEL engine and updates the hash of the newly legally modified kernel and we see that problem.

I'm not really sure if there is any solution for that since everything works as designed. The problem is that KFENCE constantly modifies the kernel itself (that's a weird decision from the security point of view) and LKRG does the job which is suppose to do. What can we do here?

  1. One of the solution might be that if someone wants to use KFENCE it must accept such perf problem (not the greatest solution).
  2. Another suggestion might be to disable kernel integrity checking if KFENCE is enabled and benefit only from the process integrity functionality.
  3. There is kfence.sample_interval boot parameter which might be changed to not modify the kernel so often...
  4. .. ?

I'm not sure what else we can do... @solardiz what do you think?

solardiz commented 3 years ago

It's not just performance overhead. I think there's little value in LKRG's kernel .text integrity checking if LKRG constantly accepts updated hashes, because this means that it'd also accept (update hash) for a concurrent malicious update. So perhaps LKRG's default should be to disable kernel integrity checking when CONFIG_KFENCE_STATIC_KEYS and to log a warning message about that.

Alternatively, we could look into excluding JUMP_LABEL related changes from the hash values, so that the same hash value would remain valid, but this is tricky to do on architectures with variable instruction size and would by design provide only partial integrity checking. An advantage, though, is that if this would be the only mode we'd support then we could completely drop our hooks/updates on JUMP_LABEL modifications, so we'd replace rather than add to LKRG's logic. I'm just thinking out loud here, I don't actually recommend this.

solardiz commented 3 years ago

disable kernel integrity checking when CONFIG_KFENCE_STATIC_KEYS and to log a warning message about that.

If we check for CONFIG_KFENCE_STATIC_KEYS at LKRG compile time, then we should also have a compile-time #warning.

solardiz commented 3 years ago

disable kernel integrity checking when CONFIG_KFENCE_STATIC_KEYS

BTW, it's a pity we don't currently have a mode where LKRG would check integrity of most important kernel globals, but not of the code. We can fairly easily introduce this mode for SELinux state checking. We might want to also check sys_call_table there, and maybe a few other key structures.

morfikov commented 3 years ago

What can we do here?

1. One of the solution might be that if someone wants to use KFENCE it must 
    accept such perf problem (not the greatest solution).

I think this is out of the table because distro kernels will probably enable this feature by default, since as they say "it's designed to be enabled in production kernels."

melver commented 2 years ago

FYI - https://git.kernel.org/torvalds/c/4f612ed3f748962cbef1316ff3d323e2b9055b6e , which is also in 5.15-stable. I expect most distro kernels to eventually disable the static keys / jump labels mode.