lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
402 stars 72 forks source link

kretprobes.maxactive should depend on number of cores #244

Closed solardiz closed 1 year ago

solardiz commented 1 year ago

Description

@redplait observed that we set kprobes' maxactive to 40 unconditionally, whereas the actual number could perhaps more reasonably depend on the number of cores. He made this commit in a private fork, but just wouldn't make a pull request on his own - so I am doing it for him.

How Has This Been Tested?

Relying on CI only.

solardiz commented 1 year ago

Just force-pushed a minor coding style change to closer match the rest of the source file's.

solardiz commented 1 year ago

The get_kprobe_maxactive function is misplaced in src/modules/database/p_database.c - I'm going to move it to src/modules/exploit_detection/syscalls/p_install.c.

This commit removes a lot of instances of .maxactive = 40,. A concern is whether we do the dynamic assignment for all of those now. I just did a quick check:

$ git show | fgrep -c 'maxactive = 40'
41
$ grep -r ^GENERATE_INSTALL_FUNC . | wc -l
41

This looks consistent to me.

A number of hard-coded 40's remain:

src/modules/database/FTRACE/p_ftrace_enable_sysctl/p_ftrace_enable_sysctl.c:    .maxactive = 40,
src/modules/database/FTRACE/p_ftrace_modify_all_code/p_ftrace_modify_all_code.c:    .maxactive = 40,
src/modules/database/JUMP_LABEL/p_arch_jump_label_transform_apply/p_arch_jump_label_transform_apply.c:    .maxactive = 40,
src/modules/database/JUMP_LABEL/p_arch_jump_label_transform/p_arch_jump_label_transform.c:    .maxactive = 40,
src/modules/database/arch/x86/p_switch_idt/p_switch_idt.c:    .maxactive = 40,
src/modules/database/TRACEPOINT/p_arch_static_call_transform/p_arch_static_call_transform.c:    .maxactive = 40,
src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_override_sync/p_ovl_override_sync.c:   p_ovl_override_sync_kretprobe.maxactive = 40;

Perhaps these also need to be dealt with. And perhaps that would be a reason to make the get_kprobe_maxactive function non-static.

solardiz commented 1 year ago

The get_kprobe_maxactive function is misplaced in src/modules/database/p_database.c - I'm going to move it to src/modules/exploit_detection/syscalls/p_install.c.

Done in another force-push.

solardiz commented 1 year ago

A number of hard-coded 40's remain

Noticed that @redplait had fixed that as part of another commit, but along with unrelated changes in the same commit - will have to extract just the right ones.

solardiz commented 1 year ago

CI is happy about the first commit (20 successful checks). I'm about to add the other, which will then need more work on it.

solardiz commented 1 year ago

I've just added:

commit dcc92ecc0752f9679d680ed8898e09ec11b4d9ab
Author: redp <redplait@gmail.com>
Date:   Wed Oct 19 18:33:52 2022 +0300

    add get_kprobe_maxactive to remaining kprobes

which contains about half of the changes from @redplait's original commit:

commit 104e30daa75a1a362e719b5755032e222acba234
Author: redp <redplait@gmail.com>
Date:   Wed Oct 19 18:33:52 2022 +0300

    add db_kprobes_add/del & get_kprobe_maxactive to remaining kprobes

One issue with that commit is it seems to forget to add usage of get_kprobe_maxactive in src/modules/database/TRACEPOINT/p_arch_static_call_transform/p_arch_static_call_transform.c, even though it added the function prototype to there.

Another issue is the prototype should probably be in a header file rather than added to each C file individually.

We probably also want to remove the initial .maxactive = 40, lines from these files, just like the first commit removed them from most files.

Finally, CI is already unhappy about this commit on many (but not all) distros - I don't yet know why (it was happy about @redplait's original as part of its tree) - will take a look.

solardiz commented 1 year ago

Also, src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_override_sync/p_ovl_override_sync.c isn't yet taken care of.

solardiz commented 1 year ago

CI is already unhappy about this commit on many (but not all) distros - I don't yet know why

That was obvious - I forgot to remove the static.

solardiz commented 1 year ago

Curiously, in this hack we don't set maxactive at all (keeping it at 0), yet it somehow works? -

   /*
    * Linux kernel 5.7+ no longer exports the kallsyms_lookup_name symbol for
    * use from modules.  We reuse the workaround originally introduced in the
    * LTTng module to access that symbol anyway.
    */
   memset(&p_kprobe, 0, sizeof(p_kprobe));
   p_kprobe.pre_handler = p_tmp_kprobe_handler;
   p_kprobe.symbol_name = "kallsyms_lookup_name";
   if ( (p_ret = register_kprobe(&p_kprobe)) < 0) {
solardiz commented 1 year ago

Used this to verify every instance of register_kretprobe is now preceded by assignment to maxactive:

fgrep -wr -B1 register_kretprobe src/
solardiz commented 1 year ago

One issue with that commit is it seems to forget to add usage of get_kprobe_maxactive in src/modules/database/TRACEPOINT/p_arch_static_call_transform/p_arch_static_call_transform.c, even though it added the function prototype to there.

Another issue is the prototype should probably be in a header file rather than added to each C file individually.

We probably also want to remove the initial .maxactive = 40, lines from these files, just like the first commit removed them from most files.

Also, src/modules/exploit_detection/syscalls/override/overlayfs/p_ovl_override_sync/p_ovl_override_sync.c isn't yet taken care of.

I took care of all of these. Also, prefixed the function name with p_ as that's our active convention for now for LKRG's own symbols, especially for symbols that are not local to a file.

Added my Co-authored-by tags to these commits as they're now significantly different from @redplait's originals.

Here's a new concern: is p_db.p_cpu.possible_CPUs already initialized before we install the first kretprobe? Need to check.

solardiz commented 1 year ago

Here's a new concern: is p_db.p_cpu.possible_CPUs already initialized before we install the first kretprobe? Need to check.

And the answer is no! Need to fix the code somehow...

solardiz commented 1 year ago

CI is happy again, but let's not merge this until we've fixed the uninitialized "possible CPUs" count. In my testing, it's 0 for the first 37 kretprobes installed by LKRG on loading, and is only properly initialized for the last 3 (total 40 in this test).

solardiz commented 1 year ago

the actual number could perhaps more reasonably depend on the number of cores.

BTW, I am not 100% convinced this is sound reasoning, and that the number should exactly match the number of logical CPUs, although it does sound like a fine hypothesis.

solardiz commented 1 year ago

Let's just call num_possible_cpus() instead of using p_db.p_cpu.possible_CPUs. Worked right on my test system (with a debugging printk added), force-pushed, now waiting for CI.

solardiz commented 1 year ago

Should have read the kernel documentation first. Here it is:

While the probed function is executing, its return address is
stored in an object of type kretprobe_instance.  Before calling
register_kretprobe(), the user sets the maxactive field of the
kretprobe struct to specify how many instances of the specified
function can be probed simultaneously.  register_kretprobe()
pre-allocates the indicated number of kretprobe_instance objects.

For example, if the function is non-recursive and is called with a
spinlock held, maxactive = 1 should be enough.  If the function is
non-recursive and can never relinquish the CPU (e.g., via a semaphore
or preemption), NR_CPUS should be enough.  If maxactive <= 0, it is
set to a default value.  If CONFIG_PREEMPT is enabled, the default
is max(10, 2*NR_CPUS).  Otherwise, the default is NR_CPUS.

It's not a disaster if you set maxactive too low; you'll just miss
some probes.  In the kretprobe struct, the nmissed field is set to
zero when the return probe is registered, and is incremented every
time the probed function is entered but there is no kretprobe_instance
object available for establishing the return probe.

and the code (also in our oldest supported kernels):

        /* Pre-allocate memory for max kretprobe instances */
        if (rp->maxactive <= 0) {
#ifdef CONFIG_PREEMPTION
                rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());
#else
                rp->maxactive = num_possible_cpus();
#endif
        }

This is right inside register_kretprobe.

Perhaps we don't need to duplicate nor override this logic (introduce our minimum of 40), and can simply pass 0 for maxactive?

(BTW, this and some other writes to rp fields mean we can't set the struct to const anyway.)

solardiz commented 1 year ago

If CONFIG_PREEMPT is enabled, the default is max(10, 2*NR_CPUS).

I wonder why they make it at least 10. Does this mean 2*NR_CPUS is sometimes insufficient? If so, can it possibly be sometimes insufficient on higher-CPU-count systems as well?

solardiz commented 1 year ago

They're now introducing the minimum of 10 for non-CONFIG_PREEMPT as well:

https://lore.kernel.org/all/20221025100117.18667-1-wuqiang.matt@bytedance.com/T/

Default value of maxactive is set as num_possible_cpus() for nonpreemptable
systems. For a 2-core system, only 2 kretprobe instances would be allocated
in default, then these 2 instances for execve kretprobe are very likely to
be used up with a pipelined command.

This is our indication that num_possible_cpus() is sometimes insufficient, but then is 10 or 40 or 2*num_possible_cpus() always sufficient? Doubtful!

Anyway, since their (old) default is now known not to be good enough (and is about to change), perhaps we should in fact be overriding it - we just need to revise what we override it with to be at least as good as the kernel's new default.

solardiz commented 1 year ago

While at it, interesting data+patch on kretprobe scalability: https://github.com/mattwuq/kretprobe https://lore.kernel.org/all/20210830173324.32507-1-wuqiang.matt@bytedance.com/T/

solardiz commented 1 year ago

Now trying:

/*
 * Old Linux default was max(10, 2*num_possible_cpus()) with CONFIG_PREEMPTION,
 * num_possible_cpus() otherwise.  New is max(10, num_possible_cpus()) without
 * CONFIG_PREEMPTION.  LKRG's old default was 40.  Let's use max of these all.
 */
   return max_t(unsigned int, 40, 2*num_possible_cpus());
solardiz commented 1 year ago

I think I'm too quick to refer to Linux's new behavior in that comment. I've just sent a follow-up to the thread on LKML, which might affect what the accepted new behavior actually will be.

solardiz commented 1 year ago

Current understanding: with preemption and/or functions yielding execution on their own (e.g., while waiting for I/O), there isn't a reasonable maximum value we could use that would be guaranteed to be sufficient.

However, switching from kretprobes to kprobes where we can would side-step the issue. So maybe we should? (Not in this PR.)

Right now, we consistently use kretprobes for all of our hooks (except the temporary hook on kallsyms_lookup_name), even when we only hook function entry, where we could have used simple kprobes instead. However, we'd still have many uses of kretprobes, so we wouldn't fully avoid the issue.

solardiz commented 1 year ago

[PATCH v2] kprobes: kretprobe events missing on 2-core KVM guest on LKML accepted my suggestion, and I've just revised the comment in here to avoid inconsistency with likely near-future Linux kernel behavior.