lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
403 stars 72 forks source link

High CPU load on Debian 12 VM caused by LKRG #301

Open gnd opened 5 months ago

gnd commented 5 months ago

Hello,

we recently upgraded some of our VMs to Debian 12. They are used to run php8.2 for some web apps. However as soon as we recompiled LKRG with the new kernel and started it, we noticed the CPU reaching 100% very fast. This leads to a machine lockups and the apps become slow and non-responsive.

We tried many things but it seems like LKRG is the issue. Once started the load reached 100% very fast, once turned off, load falls back to normal within a minute. We run LKRG on dozens of machines but only the ones running Debian 12 AND php have this issue. Older Debian machines with PHP and LKRG have no problem. So do machines that do not run PHP workloads.

We tried fiddling with the module's parameters, eg. krg.profile_validate setting it via sysctl all the way to 0, but this didnt help.

We also tried looking into older LKRG releases and run them - with the same result (specifically it was 7db7483880bf4fb5e3e1046ad688d7d66c2f0ed8). In the current state we cant run LKRG, even tho we would like to have it :(

Do you have any ideas what might be wrong, or how to help you debug this issue ? Thanks !

Attached is an screenshot from Grafana, showing the effect of re/enabling LKRG 3 times in a row.

lkrg_load

solardiz commented 5 months ago

Thank you for reporting this @gnd. My main two guesses as to what could be causing this are:

  1. Too frequent kernel integrity verification, which LKRG by default does not only periodically, but also on "random events". However, if you did in fact tried lowering lkrg.profile_validate all the way to 0 and that didn't help, this guess is ruled out. You may want to double-check, though, by setting lkrg.kint_validate to a lower value (it should be sufficient to lower it from 3 to 2, but you can also try 0).

  2. Too frequent updates of the kernel's code. The kernel uses self-modifying code for so-called "jump labels", and LKRG keeps track of that. In fact, currently LKRG does so even when lkrg.kint_validate is 0, so that you'd be able to switch from 0 to non-0 later. Maybe we need to add a mode where such tracking is also disabled, or just disable it at 0 and either don't allow switching to non-0 without LKRG reload or perform hash recalculation when switching from 0 to non-0. Maybe we also need to add a way to update hashes to reflect a "jump label" change quickly, without full recalculation, although for that we'd have to use weaker hashing or a large number of hashes (e.g., one hash per 4 KiB).

Per your analysis so far, this is more likely issue 2 above.

It's puzzling that PHP causes this. It's also puzzling that a "jump label" would presumably be switching back and forth - normally, these are only switched once or very infrequently (on changes to kernel runtime configuration via sysctl or such). This could indicate a minor kernel bug, where what was meant to be an optimization ended up the other way around, since even without LKRG updating the kernel code has some performance cost.

Adam-pi3 commented 5 months ago

Is it possible to see the list of all processes while you have such a spike of CPU usage? If the problem is related to JUMP_LABEL we should see a spikes related to kernel worker threads

gnd commented 5 months ago

Hello, unfortunately, if you mean the number of kworker processes, their numbers remained the same. Here is a log:

# ps -ef|grep kworker|grep -v grep|wc -l
37
# systemctl start lkrg; sleep 240; ps -ef|grep kworker|grep -v grep|wc -l
39
# w
 10:28:33 up 5 days, 14:28,  3 users,  load average: 143.80, 74.00, 30.48
# systemctl stop lkrg
solardiz commented 5 months ago

I think Adam meant not the number of those processes, but whether they're the ones actively running on CPU (e.g. per top) during the load spikes. Anyway, you show that the number of kworker processes is way lower than the load average, suggesting that there are many other processes in running state. It would be helpful to see the output of ps axo pid,pcpu,stat,time,wchan:30,comm k -pcpu during one of those load spikes.

gnd commented 5 months ago

Hello, attached are two files, One before enabling LKRG, second one after LKRG is enabled, when load reached > 100.

ps_before.txt ps_after.txt

solardiz commented 5 months ago

Thanks @gnd. This is puzzling. We really need the WCHAN field to hopefully figure it out. I don't know why exactly it is empty for you, but perhaps you need to run ps with greater privileges?

gnd commented 5 months ago

This might be because of some custom sysctl settings .. let me check