Open morfikov opened 3 years ago
As I recall, kfence was seen in a backtrace in one of the crashes at https://github.com/openwall/lkrg/runs/2693541239 - and this makes me wonder whether Adam's addition of some cpu_relex();
calls today possibly addressed this issue too? @morfikov Can you re-test? Thanks!
No, that didn't help.
@morfikov can you dump the stack for the process[es] with high CPU usage? you can do it via
cat /proc/PID/stack
If you can provide more than one process' stack it would be useful
There's only one process that utilizes a lot of CPU when LKRG is loaded:
PID USER GROUP PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
152 root root 20 0 0 0 0 R 7.6 0.0 1:45.35 kworker/0:2+events
Sometimes the file is empty, but when it has some content then it looks like this:
# cat /proc/152/stack
[<0>] worker_thread+0xc6/0x3b0
[<0>] kthread+0x138/0x160
[<0>] ret_from_fork+0x22/0x30
It looks like unrelated to LKRG. LKRG's verification routine runs on unbound
queue, like:
root 7357 2.9 0.0 0 0 ? I 22:05 0:00 [kworker/u256:6-events_unbound]
root 7359 4.6 0.0 0 0 ? I 22:05 0:00 [kworker/u256:7-events_unbound]
root 7493 1.6 0.0 0 0 ? I 22:05 0:00 [kworker/u256:8-events_unbound]
root 7502 0.6 0.0 0 0 ? I 22:05 0:00 [kworker/u256:10-events_unbound]
Also, from the stack trace it looks like something run on the event queue (not unbound
) but finished the work and nothing is there anymore. Sometime kernel keeps such processes if such queue runs relatively often (optimization algorithm to avoid constant creation and destruction of the process).
If LKRG is executed and didn't finish the work, you would see on the stack trace something similar to:
[<0>] p_check_integrity+0xe4/0x1120 [p_lkrg]
Can you check if you have any unbound queue in the process list? Maybe under KFENCE when LKRG sends IPI the algorithm is somehow slower. However, it doesn't look like there is anything wrong from the LKRG perspective.
The only difference in my system when this process consumes ~10% CPU is when LKRG is loaded. So the kernel is the same, the config is the same and the system state is also the same. Also, one of my friends confirmed that when kfence is enabled and the lkrg module is loaded, the process instantly causes high CPU usage on his machine too. So definitely this issue is related to LKRG somehow.
Can you check if you have any unbound queue in the process list?
Usually there's one or two of the unbound processes. Sometimes ps can't see any of them. But when it does, the pid is the same:
# ps aux | grep -i unbound
root 31 0.1 0.0 0 0 ? I 07:32 0:00 [kworker/u8:1-events_unbound]
root 32 0.1 0.0 0 0 ? I 07:32 0:00 [kworker/u8:2-events_unbound]
root 33 0.0 0.0 0 0 ? I 07:32 0:00 [kworker/u8:3-events_unbound]
root 167 0.4 0.0 0 0 ? I 07:32 0:01 [kworker/u8:5-events_unbound]
root 170 0.0 0.0 0 0 ? I 07:32 0:00 [kworker/u8:7-events_unbound]
Can you dump the stack for each of the events_unbound
processes when you see that issue?
The good news is I think I know what's going on. The bad news is that we would need to debate if we can do anything here...
I've compiled the kernel with the same options which you mentioned and I was able to repro the issue. In my case, the problematic was the the kworker process associated with power management(?). At first, I thought that the problem maybe is related to the random events which LKRG uses to make verification unpredictable. However, when I disabled them, the problem was still there. I was able to catch the scratches of stack trace of the problematic kworker and I saw that:
[<0>] preempt_schedule_thunk+0x16/0x18
[<0>] kprobe_busy_end+0x33/0x40
[<0>] trampoline_handler+0x68/0x70
[<0>] kretprobe_trampoline+0x25/0x50
kprobe_busy_end
is called when the process is being preempted / rescheduled and there is unfinished KPROBE handler associated with it. This was weird since there shouldn't be such overwhelming activity and I started to dig in more. I started to read how KFENCE really works and I read in the documentation this:
Guarded allocations are set up based on the sample interval. After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool (allocation sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To “gate” a KFENCE allocation through the main allocator’s fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE.
static branches
are using *_JUMP_LABEL under-the-hood. This means that KFENCE must constantly overwriting kernel .text section and modify kernel itself for every sample interval which is set to:
CONFIG_KFENCE_SAMPLE_INTERVAL=100
I run LKRG under debug info mode and yes... I saw in the logs the flood of the info messages from the LKRG's JUMP_LABEL engine:
[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:13 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] New modifications => 5
[Mon May 31 00:02:14 2021] [p_lkrg] [JUMP_LABEL <batch mode>] Updating kernel core .text section hash!
I dig in more and I saw that KFENCE constantly modifies the following functions:
[Mon May 31 00:10:51 2021] [0xffffffff814a674e] [__kmalloc_track_caller+0x14e/0x420]
[Mon May 31 00:10:51 2021] [0xffffffff814a2d83] [memcg_slab_post_alloc_hook+0x13/0x500]
[Mon May 31 00:10:51 2021] [0xffffffff814a82de] [kmem_cache_free+0x11e/0x6a0]
[Mon May 31 00:10:51 2021] [0xffffffff814cf94c] [uncharge_batch+0xfc/0x250]
[Mon May 31 00:10:51 2021] [0xffffffff814d1df6] [obj_cgroup_charge_pages+0x106/0x160]
I'm using SLUB allocator but similar logic is for any SL*B. In the code KFENCE ends-up here:
static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
{
#ifdef CONFIG_KFENCE_STATIC_KEYS
if (static_branch_unlikely(&kfence_allocation_key))
#else
if (unlikely(!atomic_read(&kfence_allocation_gate)))
#endif
return __kfence_alloc(s, size, flags);
return NULL;
}
If CONFIG_KFENCE_STATIC_KEYS
is not defined, KFENCE shouldn't constantly modify .text section (however, I didn't test it).
So that's the problem... KFENCE constantly modifies kernel itself, LKRG is hooked into the JUMP_LABEL engine and updates the hash of the newly legally modified kernel and we see that problem.
I'm not really sure if there is any solution for that since everything works as designed. The problem is that KFENCE constantly modifies the kernel itself (that's a weird decision from the security point of view) and LKRG does the job which is suppose to do. What can we do here?
kfence.sample_interval
boot parameter which might be changed to not modify the kernel so often...I'm not sure what else we can do... @solardiz what do you think?
It's not just performance overhead. I think there's little value in LKRG's kernel .text
integrity checking if LKRG constantly accepts updated hashes, because this means that it'd also accept (update hash) for a concurrent malicious update. So perhaps LKRG's default should be to disable kernel integrity checking when CONFIG_KFENCE_STATIC_KEYS
and to log a warning message about that.
Alternatively, we could look into excluding JUMP_LABEL related changes from the hash values, so that the same hash value would remain valid, but this is tricky to do on architectures with variable instruction size and would by design provide only partial integrity checking. An advantage, though, is that if this would be the only mode we'd support then we could completely drop our hooks/updates on JUMP_LABEL modifications, so we'd replace rather than add to LKRG's logic. I'm just thinking out loud here, I don't actually recommend this.
disable kernel integrity checking when
CONFIG_KFENCE_STATIC_KEYS
and to log a warning message about that.
If we check for CONFIG_KFENCE_STATIC_KEYS
at LKRG compile time, then we should also have a compile-time #warning
.
disable kernel integrity checking when
CONFIG_KFENCE_STATIC_KEYS
BTW, it's a pity we don't currently have a mode where LKRG would check integrity of most important kernel globals, but not of the code. We can fairly easily introduce this mode for SELinux state checking. We might want to also check sys_call_table
there, and maybe a few other key structures.
What can we do here?
1. One of the solution might be that if someone wants to use KFENCE it must accept such perf problem (not the greatest solution).
I think this is out of the table because distro kernels will probably enable this feature by default, since as they say "it's designed to be enabled in production kernels."
FYI - https://git.kernel.org/torvalds/c/4f612ed3f748962cbef1316ff3d323e2b9055b6e , which is also in 5.15-stable. I expect most distro kernels to eventually disable the static keys / jump labels mode.
There's a new feature in kernel 5.12 called kfence. When it's enabled with the following confg (default):
there's no issues with the system. It works more or less in the same way as before. But when the LKRG module is loaded, the
kworker/0:2-events
process starts to consume high amount of CPU. In my system it's about 10% or about 20-25% of a single CPU core.According to the kernel doc, kfence
is a low-overhead sampling-based memory safety error detector
andis designed to be enabled in production kernels
.