landlock-lsm / linux

Linux kernel - See Landlock issues
https://git.kernel.org/pub/scm/linux/kernel/git/mic/linux.git/
Other
35 stars 10 forks source link

Performance impact measurement tool #24

Open l0kod opened 9 months ago

l0kod commented 9 months ago

We need tooling to measure the performance impact of kernel changes. Until now, we used simple scripts to get an idea of the worse case scenarios, and manual profiling.

It would be useful to have a way to measure in a predictive way the overhead of sandboxing on a set of scenarios (e.g. path walk) with synthetic benchmarks. Several tracing kernel features might be interesting: tracepoints, eBPF...

We might need kernel changes such as tracepoints.

This measurement tool should be simple to use manually and by a CI (see landlock-test-tools). It would be useful to measure performance improvement changes (#1 and #9).

See seccomp_benchmark.c and other benchmark programs in tools/testing/selftests.

sm1ling-knight commented 7 months ago

I don't think that adding tracepoints is a good solution here. We want to measure the performance impact of kernel changes, i.e. to have information about overhead of Landlock hooks for every required syscall and kernel method (e.g. file_open(), udp_sendmsg()) after every change connected with possible performance overhead. Such a tracepoint would be useless for users.

Possibly, it would be better to use kprobes. However, they have some issues compared to tracepoints:

What do you think?

[1] https://lore.kernel.org/all/20230111065930.1494-1-cuiyunhui@bytedance.com/

We can obtain the information with kretprobe, but as we know, kprobe gets
the result by trappig in an exception, which loses performance compared
to tracepoint.
...
  kr:
Time per request: 43.168 [ms] (mean)
Time per request: 0.043 [ms] (mean, across all concurrent requests)

tracepoint:
Time per request: 41.004 [ms] (mean)
Time per request: 0.041 [ms] (mean, across all concurrent requests
l0kod commented 7 months ago

I think kprobes/kretprobes would be fine from a performance point of view, but that may not be enough to identify requests to specific kernel objects (e.g. the performance impact of opening a specific file). That may not be an issue for benchmark if we can scope the probes to only the benchmarked process though.

I think tracepoints are useful when we need to increment a counter or print variables according to code within a function or with a specific semantic (not necessarily passed as function argument).

The kernel API change and especially the Landlock changes should not be a big issue because the benchmark tool would mostly live with the kernel source code and could be part of the Kselftests and be updated with the kernel code.

gnoack commented 7 months ago

Hello @sm1ling-knight -- just for coordination; did you start working on this bug already?

sm1ling-knight commented 7 months ago

Hello @sm1ling-knight -- just for coordination; did you start working on this bug already?

Hello! Yep, I'm currently working on it.

l0kod commented 7 months ago

Good! We should explicit say when we started or want to work on a task to avoid duplicated work. :wink:

sm1ling-knight commented 7 months ago

@l0kod, do you have any cases of what events would be useful to trace in Landlock? I currently have an idea of tracing landlock_unmask_layers() in order to view frequency of access checks.

I also want to add CPU profiling to measure overhead of Landlock changes. What do you think?

l0kod commented 7 months ago

@l0kod, do you have any cases of what events would be useful to trace in Landlock? I currently have an idea of tracing landlock_unmask_layers() in order to view frequency of access checks.

I expect landlock_unmask_layers() to be very quick to execute, but the number of calls looks like an interesting metric. Same for landlock_find_rule(). Increasing the depth of a path will increase the number of calls and the time taken by these slower functions: is_access_to_paths_allowed() and collect_domain_accesses(). Being able to group the number of landlock_unmask_layers() and landlock_find_rule() calls per caller should help.

On top of that, timing each LSM hook (including the Landlock hook implementations) called per syscall should give a good overview of Landlock's impact on the kernel's execution time.

I also want to add CPU profiling to measure overhead of Landlock changes. What do you think?

That would be very useful!

l0kod commented 7 months ago

For reference, Flame Graphs are a visualization tool (relying on kernel features such as perf) useful to see a performance impact and to debug, but not to quickly compare two implementations nor to measure according to a context (e.g., depth of a file path) in an automatic (and deterministic as much as possible) way.

gnoack commented 6 months ago

In my mind, the percentage of time spent in Landlock would be an interesting metric to capture; In particular, the entry point functions are interesting, which are hit during program execution; that is: (a) the three syscalls, and (b) the LSM hooks which Landlock overrides.

This is already recorded when doing perf record with -g. The following invocation is using the landlock-restrict tool which is part of the go-landlock library, to sandbox the ls command:

perf record --freq=max -g landlock-restrict -3 -rw / -- /bin/ls

It might be wiser to replace /bin/ls with something more long running. With ls, it was difficult to find a sufficiently high number of traces with our functions in them, because it only runs so briefly. I attempted something like find / -exec file '{}' \;, but this is admittedly a bit unrealistic and extreme, and might also spend a lot of time in execve().

The result can then be plotted with Brendan Gregg's Flame graph scripts; I'm using this:

cat ~/bin/,flame
#!/bin/sh

while getopts ":s" opt; do
    case $opt in
        s)
            CMDPREFIX=sudo
    esac
done

FG="${HOME}/git/FlameGraph"
$CMDPREFIX perf script | "${FG}/stackcollapse-perf.pl" | "${FG}/flamegraph.pl" > /tmp/flame.svg
chromium /tmp/flame.svg

or it can be analyzed with perf report, where it is a bit easier to find the numbers.

This is just how far I got with the obvious part. I figured that perf's perf.data output file has all the useful data already, when using sampling. Open question are:

sm1ling-knight commented 6 months ago

I don't think that perf is suitable for our case. Execution time of hooks may be too short for perf to provide sufficient accuracy.

On my system execution of 'file_open' hook takes about 4e-5 seconds (in average) for sandboxed find / command. This hook takes about 40% of the execution time of the openat syscall.

I got these values using bpftrace tool and implementing lsm_hook_enter/lsm_hook_exit tracepoints that wrap each lsm hook.

#define call_void_hook(FUNC, ...)               \
    do {                            \
        struct security_hook_list *P;           \
                                \
        hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
            trace_lsm_hook_enter(P->lsmid->id, #FUNC);      \
            P->hook.FUNC(__VA_ARGS__);      \
            trace_lsm_hook_exit(P->lsmid->id, #FUNC);       \
        }                       \
    } while (0)

Tracepoints were used to gather total execution time and total number of calls for 'file_open' hook and openat syscall. You can see bpftrace script similar to the one I used for gathering at the end of the message

It seems that sampling frequency should be at least on the order of 10^-5 seconds for adequate accuracy.

From Brendan Gregg's blog: -F 99: sample at 99 Hertz (samples per second). I'll sometimes sample faster than this (up to 999 Hertz), but that also costs overhead. 99 Hertz should be negligible. Also, the value '99' and not '100' is to avoid lockstep sampling, which can produce skewed results.

It turns out that the sampling frequency that does not involve overhead is on the order of 10^-2. Using perf would be unacceptable in order to measure time taken by hook accurately enough. Moreover, using perf involves problem with tracing static symbols in fs.c, net.c (such as hooks) and extracting CPU percentages (as you mentioned).

I suggest using bpftrace tool for profiling. It seems like quite convenient way to measure execution time for our case with almost zero overhead. Bpftrace allows to add gathering instructions for every required tracing blob (tracepoint, kprobe, uprobe..).

To measure time in this way, following steps are required:

  1. Create tracepoints that I described at the beginning (lsm_hook_enter/lsm_hook_exit)
  2. Write a bpftrace script that will gather the total time spent by the workload in landlock hook and in desirable syscalls. The script might look something like this:
    
    tracepoint:security:lsm_hook_enter / args->lsmid == 110 /
    {
    @lsm_start[tid, str(args->func)] = nsecs;
    }

tracepoint:security:lsm_hook_exit / args->lsmid == 110 / { @ns_lsm[tid, str(args->func)] += (nsecs - @lsm_start[tid, str(args->func)]); @count_lsm[tid, str(args->func)]++; delete(@lsm_start[tid, str(args->func)]); }

tracepoint:syscalls:sys_enter_openat { @sys_openat_start[tid] = nsecs; }

tracepoint:syscalls:sys_exit_openat { @ns_sys_openat[tid] += (nsecs - @sys_openat_start[tid]); @count_sys_openat[tid]++; delete(@sys_openat_start[tid]); }

// Shouldn't be called, since script is halted after workload end of execution. interval:s:300 { exit(); }

l0kod commented 6 months ago

Because we can change the Landlock/kernel code, I don't think we should be limited by any sampling frequency.

Your bpftrace script looks good! I guess we should not need more dependency than this tool.

  • I would love to find out how to extract these CPU percentages directly from perf.data in a scriptable way.

We indeed need all benchmarks to be simply run, eventually by a CI.

However, it would help to also be able to create a flamegraph from these benchmarks, but that could be another iteration of the patches.

  • I would like to have a collection of a few realistic scenarios which we can profile.

    • Idea: Maybe something like https://justine.lol/make/ (creates a new sandbox per compiler invocation, and can be made to run reasonably long; or just override CC= with a sandboxed GCC and use GNU make)
    • Idea: The xzdec tool is using Landlock; maybe we can find a realistic use case where this or another compression library gets invoked?
    • ...and maybe also some more artificial scenarios in the mix.

We should have both microbenchmarks and real-world benchmarks. It looks like bpftrace scripts should be good for both cases.

The main intensive and real-world scenarios that come to mind are:

sm1ling-knight commented 6 months ago

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

l0kod commented 6 months ago

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

As for the seccomp benchmark tool, it would help to keep similar tools in the kernel source tree e.g., to easily run performance tests with a kernel CI, and to add reference to performance improvements in commit messages. Moreover, the benchmark scripts may need to evolve along with future kernel versions (e.g. if an LSM hooks change).

I think higher level tooling (e.g. generate flame graphs, compare performance of two kernel versions) would make sense in landlock-test-tools though.

sm1ling-knight commented 6 months ago

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

As for the seccomp benchmark tool, it would help to keep similar tools in the kernel source tree e.g., to easily run performance tests with a kernel CI, and to add reference to performance improvements in commit messages. Moreover, the benchmark scripts may need to evolve along with future kernel versions (e.g. if an LSM hooks change).

Do you agree that adding tracepoints lsm_hook_enter/lsm_hook_exit to kernel source is rational? Initially, I thought that they would be presented as some kind of *.patch file in landlock-test-tools.

l0kod commented 5 months ago

Do you agree that adding tracepoints lsm_hook_enter/lsm_hook_exit to kernel source is rational? Initially, I thought that they would be presented as some kind of *.patch file in landlock-test-tools.

Yes, that looks reasonable to me.

sm1ling-knight commented 5 months ago

Btw, i decided to try bcc to solve this issue. It's a bit more verbose compared to bpftrace, but it makes it easier to work with command line arguments and custom output (for manual usage). I think it doesn't require a lot of dependencies, so I'll try to make the first patch using bcc. It should be easy to switch into bpftrace, if something goes wrong.

l0kod commented 5 months ago

FYI, patches fixing an issue in perf with Landlock and BTF: