lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
402 stars 72 forks source link

RHEL 9.1 _stext hash changed unexpectedly #256

Closed solardiz closed 1 year ago

solardiz commented 1 year ago

On a machine at Equinix Metal running:

[root@c3-small-x86-01 x]# cat /etc/redhat-release
Rocky Linux release 9.1 (Blue Onyx)
[root@c3-small-x86-01 x]# uname -mrs
Linux 5.14.0-162.6.1.el9_1.0.1.x86_64 x86_64

testing LKRG with kint_enforce=1, I got:

[  660.318386] LKRG: ALIVE: Loading LKRG
[  660.349204] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  660.357611] OOM killer disabled.
[  660.508989] LKRG: ISSUE: [kretprobe] register_kretprobe() for <ovl_dentry_is_whiteout> failed! [err=-2]
[  660.518402] LKRG: ISSUE: Can't hook 'ovl_dentry_is_whiteout'. This is expected when OverlayFS is not used.
[  660.696901] LKRG: ALIVE: LKRG initialized successfully
[  660.702084] OOM killer enabled.
[  660.705266] Restarting tasks ... done.
[  691.292074] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  691.301987] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly

However, then the system just works with that message not repeating so far (in another 20 minutes after the above).

solardiz commented 1 year ago

After the experiment in #257, the two ALERTs above now keep re-appearing even after I no longer run the trigger and the load is back to nearly zero.

solardiz commented 1 year ago

Tried to reproduce this with log_level=4, but no luck in an hour so far.

solardiz commented 1 year ago

Just reproduced this with log_level=4 on a different machine (Rocky Linux 9, 5.14.0-162.6.1.el9_1.0.1.x86_64, bare metal AMD EPYC, 48 logical CPUs, running one instance of while :; do sysctl lkrg.trigger=1; sleep 0.1; done):

[ 3738.820363] LKRG: ALIVE: System is clean
[ 3738.869775] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 3738.874094] LKRG: WATCH: Removing pid 34161
[ 3738.882441] LKRG: WATCH: Updating pid 34162
[ 3738.887287] LKRG: WATCH: Inserting pid 34162
[ 3738.892411] LKRG: WATCH: Updating pid 34162
[ 3738.898163] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 3738.898191] LKRG: WATCH: Removing pid 34162
[ 3738.908920] LKRG: WATCH: Hash of CPU metadata expected 0xacc9aeef40f6cd27 vs. actual 0xacc9aeef40f6cd27
[ 3738.909728] LKRG: WATCH: Updating pid 34163
[ 3738.919667] LKRG: WATCH: Exception table hash expected 0xf067711f70a14b5d vs. actual 0xf067711f70a14b5d
[ 3738.924593] LKRG: WATCH: Inserting pid 34163
[ 3738.939979] LKRG: WATCH: Updating pid 34163
[ 3738.942151] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3738.952476] LKRG: WATCH: _stext hash expected 0x4b5f369d905dd2a7 vs. actual 0x4eefed501cc4902f
[ 3738.966447] LKRG: WATCH: _rodata hash expected 0x36899303d29adcac vs. actual 0x36899303d29adcac
[ 3738.975923] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 3738.984350] LKRG: WATCH: Hash from 'module list' => [0xcf9e04d7ae808c32]
[ 3738.991677] LKRG: WATCH: Hash from 'module kobj(s)' => [0xaf78a46fbcaf97a3]
[ 3738.999389] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 3739.015043] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 3739.045672] LKRG: WATCH: Removing pid 34163
[ 3739.050900] LKRG: WATCH: Updating pid 34164
[ 3739.055720] LKRG: WATCH: Inserting pid 34164
[ 3739.060921] LKRG: WATCH: Updating pid 34164
[ 3739.066778] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 3739.066802] LKRG: WATCH: Removing pid 34164
[ 3739.077555] LKRG: WATCH: Hash of CPU metadata expected 0xacc9aeef40f6cd27 vs. actual 0xacc9aeef40f6cd27
[ 3739.078375] LKRG: WATCH: Updating pid 34165
[ 3739.088293] LKRG: WATCH: Exception table hash expected 0xf067711f70a14b5d vs. actual 0xf067711f70a14b5d
[ 3739.093240] LKRG: WATCH: Inserting pid 34165
[ 3739.108737] LKRG: WATCH: Updating pid 34165
[ 3739.110835] LKRG: WATCH: _stext hash expected 0x4b5f369d905dd2a7 vs. actual 0x4b5f369d905dd2a7
[ 3739.127940] LKRG: WATCH: _rodata hash expected 0x36899303d29adcac vs. actual 0x36899303d29adcac
[ 3739.137459] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 3739.145825] LKRG: WATCH: Hash from 'module list' => [0xcf9e04d7ae808c32]
[ 3739.153394] LKRG: WATCH: Hash from 'module kobj(s)' => [0xaf78a46fbcaf97a3]
[ 3739.161061] LKRG: ALIVE: System is clean

In the above snippet, _stext hash temporarily became different from what's expected, then got reverted to what's expected. Looks like the same happened a few more times later, until finally the difference became permanent:

[root@c2-medium-x86-01 ~]# dmesg -s 9999999 | fgrep ALERT | head
[ 3933.213500] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3933.270925] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 3994.090625] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3994.148058] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 3998.304149] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3998.361582] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 3998.472820] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3998.530448] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 3998.641604] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 3998.699276] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly

3933 is when this was still temporary, so no more alerts until 3994 (thus, for a minute). Apparently, 3994 was still temporary as well. But after 3998 it became permanent, continuing further like this:

[ 4027.013784] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4027.125154] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4027.182779] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4027.293954] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4027.351607] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4027.462991] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4027.520631] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4027.631852] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4027.689607] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly

and after I interrupt the while command, like this:

[ 4314.534828] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4314.588780] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4329.894976] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4329.949044] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4345.254455] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4345.309174] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 4360.614425] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4360.668523] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly

so detected on every check, which are at the usual 15-second intervals, and in detail:

[ 4468.082792] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 4468.106103] LKRG: WATCH: Hash of CPU metadata expected 0xacc9aeef40f6cd27 vs. actual 0xacc9aeef40f6cd27
[ 4468.116322] LKRG: WATCH: Exception table hash expected 0xf067711f70a14b5d vs. actual 0xf067711f70a14b5d
[ 4468.133828] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 4468.141411] LKRG: WATCH: _stext hash expected 0x4b5f369d905dd2a7 vs. actual 0x4eefed501cc4902f
[ 4468.155456] LKRG: WATCH: _rodata hash expected 0x36899303d29adcac vs. actual 0x36899303d29adcac
[ 4468.164894] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 4468.173245] LKRG: WATCH: Hash from 'module list' => [0xcf9e04d7ae808c32]
[ 4468.180637] LKRG: WATCH: Hash from 'module kobj(s)' => [0xaf78a46fbcaf97a3]
[ 4468.188265] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly

It's still the same hash values (_stext hash expected 0x4b5f369d905dd2a7 vs. actual 0x4eefed501cc4902f) as were seen at 3738.

Perhaps we don't handle some jump label modification, which is sometimes temporary and sometimes permanent (maybe it's some kind of self-optimization the kernel makes).

solardiz commented 1 year ago

maybe it's some kind of self-optimization the kernel makes

This would also explain why I wasn't able to re-trigger the problem after unloading and reloading LKRG (to enable log_level=4) on a non-rebooted system - perhaps the optimization was already in place and wouldn't change further. This time, I had enabled log_level=4 before the problem was triggered for the first time in this system's uptime.

Adam-pi3 commented 1 year ago

@solardiz can you repro this issue with P_LKRG_JUMP_LABEL_STEXT_DEBUG compilation? You can uncomment it here: https://github.com/lkrg-org/lkrg/blob/main/src/modules/print_log/p_lkrg_print_log.h#L35

solardiz commented 1 year ago

@Adam-pi3 Somehow P_LKRG_JUMP_LABEL_STEXT_DEBUG does not result in any additional messages, at least with the default log level (which you said elsewhere would be enough with that setting):

[  600.848611] LKRG: ALIVE: Loading LKRG
[  600.880004] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  600.889658] OOM killer disabled.
[  601.038426] LKRG: ISSUE: [kretprobe] register_kretprobe() for <ovl_dentry_is_whiteout> failed! [err=-2]
[  601.048251] LKRG: ISSUE: Can't hook 'ovl_dentry_is_whiteout'. This is expected when OverlayFS is not used.
[  601.233602] LKRG: ALIVE: LKRG initialized successfully
[  601.239227] OOM killer enabled.
[  601.242799] Restarting tasks ... done.
[  616.735937] LKRG: STATE: Enabling 'heartbeat'
[  631.899464] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  631.921665] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  632.113761] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  632.137019] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  632.153992] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  632.177410] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  647.262557] LKRG: ALIVE: System is clean
[  662.260741] LKRG: ALIVE: System is clean
[  662.277481] LKRG: ALIVE: System is clean
solardiz commented 1 year ago

With log_level=4:

[ 1017.490309] LKRG: ALIVE: System is clean
[ 1017.490321] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1017.494930] LKRG: WATCH: Inserting pid 51029
[ 1017.495002] LKRG: WATCH: Updating pid 51029
[ 1017.499493] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1017.499502] LKRG: WATCH: Removing pid 51028
[ 1017.506598] LKRG: WATCH: Updating pid 51029
[ 1017.511239] LKRG: WATCH: Updating pid 51030
[ 1017.518428] LKRG: WATCH: Updating pid 51029
[ 1017.521852] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1017.521860] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1017.522896] LKRG: WATCH: Inserting pid 51030
[ 1017.527572] LKRG: WATCH: Removing pid 51029
[ 1017.528036] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 1017.534472] LKRG: WATCH: Updating pid 51030
[ 1017.536650] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9029634 | 0x89c802] symbol[dev_queue_xmit_nit]
[ 1017.536651] LKRG: WATCH: copy[0x1f] vs now[0xffffff81] offset[9029635 | 0x89c803] symbol[dev_queue_xmit_nit]
[ 1017.536653] LKRG: WATCH: copy[0x44] vs now[0x0] offset[9029636 | 0x89c804] symbol[dev_queue_xmit_nit]
[ 1017.536663] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9039821 | 0x89efcd] symbol[netif_rx_internal]
[ 1017.536663] LKRG: WATCH: copy[0xffffff90] vs now[0x39] offset[9039822 | 0x89efce] symbol[netif_rx_internal]
[ 1017.536680] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9054862 | 0x8a2a8e] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.536681] LKRG: WATCH: copy[0x1f] vs now[0xffffffde] offset[9054863 | 0x8a2a8f] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.536681] LKRG: WATCH: copy[0x44] vs now[0x2] offset[9054864 | 0x8a2a90] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.536687] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9059748 | 0x8a3da4] symbol[netif_receive_skb]
[ 1017.536688] LKRG: WATCH: copy[0xffffff90] vs now[0x38] offset[9059749 | 0x8a3da5] symbol[netif_receive_skb]
[ 1017.536690] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9060663 | 0x8a4137] symbol[netif_receive_skb_list_internal]
[ 1017.536691] LKRG: WATCH: copy[0x1f] vs now[0xffffffaf] offset[9060664 | 0x8a4138] symbol[netif_receive_skb_list_internal]
[ 1017.536691] LKRG: WATCH: copy[0x44] vs now[0x1] offset[9060665 | 0x8a4139] symbol[netif_receive_skb_list_internal]
[ 1017.539037] LKRG: WATCH: Updating pid 51031
[ 1017.542200] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0xf6343c558d7b8e90
[ 1017.546252] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1017.549430] LKRG: WATCH: Inserting pid 51031
[ 1017.553931] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1017.558497] LKRG: WATCH: Updating pid 51031
[ 1017.562915] LKRG: WATCH: Hash from 'module list' => [0x59dbf96a4c98278d]
[ 1017.567423] LKRG: WATCH: Updating pid 51031
[ 1017.577130] LKRG: WATCH: Hash from 'module kobj(s)' => [0xb620a746161a0522]
[ 1017.586886] LKRG: WATCH: Updating pid 51031
[ 1017.591437] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 1017.596759] LKRG: WATCH: Updating pid 51032
[ 1017.643891] LKRG: WATCH: Removing pid 51030
[ 1017.647509] LKRG: WATCH: Inserting pid 51032
[ 1017.657662] LKRG: WATCH: Updating pid 51033
[ 1017.669311] LKRG: WATCH: Updating pid 51032
[ 1017.681183] LKRG: WATCH: Inserting pid 51033
[ 1017.692805] LKRG: WATCH: Removing pid 51032
[ 1017.702827] LKRG: WATCH: Updating pid 51033
[ 1017.713932] LKRG: WATCH: Updating pid 51034
[ 1017.856998] LKRG: WATCH: Inserting pid 51034
[ 1017.861717] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1017.861759] LKRG: WATCH: Removing pid 51033
[ 1017.861953] LKRG: WATCH: Updating pid 51034
[ 1017.862719] LKRG: WATCH: Updating pid 51035
[ 1017.862720] LKRG: WATCH: Inserting pid 51035
[ 1017.862803] LKRG: WATCH: Updating pid 51035
[ 1017.866792] LKRG: WATCH: Removing pid 51035
[ 1017.867112] LKRG: WATCH: Updating pid 51036
[ 1017.867113] LKRG: WATCH: Inserting pid 51036
[ 1017.867250] LKRG: WATCH: Updating pid 51036
[ 1017.869733] LKRG: WATCH: Removing pid 51036
[ 1017.871672] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1017.871678] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1017.872874] LKRG: WATCH: Updating pid 51037
[ 1017.877664] LKRG: WATCH: Removing pid 51034
[ 1017.878021] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 1017.881819] LKRG: WATCH: Inserting pid 51037
[ 1017.886820] LKRG: WATCH: Updating pid 51038
[ 1017.886840] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9029634 | 0x89c802] symbol[dev_queue_xmit_nit]
[ 1017.886841] LKRG: WATCH: copy[0x1f] vs now[0xffffff81] offset[9029635 | 0x89c803] symbol[dev_queue_xmit_nit]
[ 1017.886842] LKRG: WATCH: copy[0x44] vs now[0x0] offset[9029636 | 0x89c804] symbol[dev_queue_xmit_nit]
[ 1017.886854] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9039821 | 0x89efcd] symbol[netif_rx_internal]
[ 1017.886855] LKRG: WATCH: copy[0xffffff90] vs now[0x39] offset[9039822 | 0x89efce] symbol[netif_rx_internal]
[ 1017.886872] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9054862 | 0x8a2a8e] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.886873] LKRG: WATCH: copy[0x1f] vs now[0xffffffde] offset[9054863 | 0x8a2a8f] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.886874] LKRG: WATCH: copy[0x44] vs now[0x2] offset[9054864 | 0x8a2a90] symbol[__netif_receive_skb_core.constprop.0]
[ 1017.886880] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9059748 | 0x8a3da4] symbol[netif_receive_skb]
[ 1017.886881] LKRG: WATCH: copy[0xffffff90] vs now[0x38] offset[9059749 | 0x8a3da5] symbol[netif_receive_skb]
[ 1017.886883] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9060663 | 0x8a4137] symbol[netif_receive_skb_list_internal]
[ 1017.886884] LKRG: WATCH: copy[0x1f] vs now[0xffffffaf] offset[9060664 | 0x8a4138] symbol[netif_receive_skb_list_internal]
[ 1017.886885] LKRG: WATCH: copy[0x44] vs now[0x1] offset[9060665 | 0x8a4139] symbol[netif_receive_skb_list_internal]
[ 1017.891323] LKRG: WATCH: Updating pid 51037
[ 1017.892422] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0xf6343c558d7b8e90
[ 1017.895709] LKRG: WATCH: Inserting pid 51038
[ 1017.896533] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1017.896534] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1017.896539] LKRG: WATCH: Hash from 'module list' => [0x59dbf96a4c98278d]
[ 1017.896544] LKRG: WATCH: Hash from 'module kobj(s)' => [0xb620a746161a0522]
[ 1017.896544] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 1018.000820] LKRG: WATCH: Removing pid 51037
[ 1018.009074] LKRG: WATCH: Updating pid 51038
[ 1018.010689] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1018.021008] LKRG: WATCH: Updating pid 51039
[ 1018.034119] LKRG: WATCH: Removing pid 51038
[ 1018.043581] LKRG: WATCH: Inserting pid 51039
[ 1018.053736] LKRG: WATCH: Removing pid 51031
[ 1018.063915] LKRG: WATCH: Updating pid 51039
[ 1018.077167] LKRG: WATCH: Updating pid 51040
[ 1018.196261] LKRG: WATCH: Inserting pid 51040
[ 1018.200957] LKRG: WATCH: Updating pid 51040
[ 1018.200979] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1018.200986] LKRG: WATCH: Removing pid 51039
[ 1018.201520] LKRG: WATCH: Updating pid 51041
[ 1018.201523] LKRG: WATCH: Inserting pid 51041
[ 1018.201870] LKRG: WATCH: Updating pid 51041
[ 1018.205553] LKRG: WATCH: Updating pid 51040
[ 1018.215289] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1018.216327] LKRG: WATCH: Updating pid 51040
[ 1018.220880] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1018.226377] LKRG: WATCH: Updating pid 51042
[ 1018.236299] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0x4616d9c67c146959
[ 1018.244517] LKRG: WATCH: Inserting pid 51042
[ 1018.253166] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1018.259063] LKRG: WATCH: Updating pid 51042
[ 1018.263500] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1018.273057] LKRG: WATCH: Removing pid 51042
[ 1018.277231] LKRG: WATCH: Hash from 'module list' => [0x59dbf96a4c98278d]
[ 1018.286441] LKRG: WATCH: Updating pid 51043
[ 1018.290790] LKRG: WATCH: Hash from 'module kobj(s)' => [0xb620a746161a0522]
[ 1018.298872] LKRG: WATCH: Inserting pid 51043
[ 1018.302311] LKRG: WATCH: Removing pid 51041
[ 1018.302435] LKRG: WATCH: Updating pid 51044
[ 1018.302436] LKRG: WATCH: Inserting pid 51044
[ 1018.302572] LKRG: WATCH: Updating pid 51044
[ 1018.303475] LKRG: ALIVE: System is clean

Notice the ALERTs this has in the middle. BTW, looks like we have a sign extension bug in those messages.

solardiz commented 1 year ago

One more example:

[ 1339.046841] LKRG: ALIVE: System is clean
[ 1339.953401] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1340.027173] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1341.654411] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1341.703173] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1341.885773] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1344.600501] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1344.616687] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1344.626426] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1344.641208] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[ 1344.655721] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9029634 | 0x89c802] symbol[dev_queue_xmit_nit]
[ 1344.665811] LKRG: WATCH: copy[0x1f] vs now[0xffffff81] offset[9029635 | 0x89c803] symbol[dev_queue_xmit_nit]
[ 1344.675985] LKRG: WATCH: copy[0x44] vs now[0x0] offset[9029636 | 0x89c804] symbol[dev_queue_xmit_nit]
[ 1344.685563] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9039821 | 0x89efcd] symbol[netif_rx_internal]
[ 1344.695651] LKRG: WATCH: copy[0xffffff90] vs now[0x39] offset[9039822 | 0x89efce] symbol[netif_rx_internal]
[ 1344.705750] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9054862 | 0x8a2a8e] symbol[__netif_receive_skb_core.constprop.0]
[ 1344.717394] LKRG: WATCH: copy[0x1f] vs now[0xffffffde] offset[9054863 | 0x8a2a8f] symbol[__netif_receive_skb_core.constprop.0]
[ 1344.729466] LKRG: WATCH: copy[0x44] vs now[0x2] offset[9054864 | 0x8a2a90] symbol[__netif_receive_skb_core.constprop.0]
[ 1344.740611] LKRG: WATCH: copy[0x66] vs now[0xffffffeb] offset[9059748 | 0x8a3da4] symbol[netif_receive_skb]
[ 1344.750715] LKRG: WATCH: copy[0xffffff90] vs now[0x38] offset[9059749 | 0x8a3da5] symbol[netif_receive_skb]
[ 1344.760816] LKRG: WATCH: copy[0xf] vs now[0xffffffe9] offset[9060663 | 0x8a4137] symbol[netif_receive_skb_list_internal]
[ 1344.772054] LKRG: WATCH: copy[0x1f] vs now[0xffffffaf] offset[9060664 | 0x8a4138] symbol[netif_receive_skb_list_internal]
[ 1344.783372] LKRG: WATCH: copy[0x44] vs now[0x1] offset[9060665 | 0x8a4139] symbol[netif_receive_skb_list_internal]
[ 1344.798691] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0xf6343c558d7b8e90
[ 1344.810655] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1344.819548] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1344.827574] LKRG: WATCH: Hash from 'module list' => [0x375627ad13ce085b]
[ 1344.834647] LKRG: WATCH: Hash from 'module kobj(s)' => [0x6aa6d1d4ee4d016d]
[ 1344.841977] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[ 1348.900123] LKRG: WATCH: [JUMP_LABEL <batch mode>] New modifications => 5
[ 1359.960556] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1359.971248] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1359.981100] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1359.995646] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0x4616d9c67c146959
[ 1360.007741] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1360.016636] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1360.024662] LKRG: WATCH: Hash from 'module list' => [0x375627ad13ce085b]
[ 1360.031747] LKRG: WATCH: Hash from 'module kobj(s)' => [0x6aa6d1d4ee4d016d]
[ 1360.039111] LKRG: ALIVE: System is clean
[ 1375.320659] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1375.331388] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1375.341244] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1375.355698] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0x4616d9c67c146959
[ 1375.367704] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1375.376604] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1375.384711] LKRG: WATCH: Hash from 'module list' => [0x375627ad13ce085b]
[ 1375.391839] LKRG: WATCH: Hash from 'module kobj(s)' => [0x6aa6d1d4ee4d016d]
[ 1375.399204] LKRG: ALIVE: System is clean
[ 1390.680806] LKRG: WATCH: lkrg_dummy returned 3 vs. expected 3
[ 1390.691469] LKRG: WATCH: Hash of CPU metadata expected 0xdc35e1e9ad37a9c3 vs. actual 0xdc35e1e9ad37a9c3
[ 1390.701329] LKRG: WATCH: Exception table hash expected 0x77d60aac450b9323 vs. actual 0x77d60aac450b9323
[ 1390.715968] LKRG: WATCH: _stext hash expected 0x4616d9c67c146959 vs. actual 0x4616d9c67c146959
[ 1390.728044] LKRG: WATCH: _rodata hash expected 0x9dd37fb1edec5ce vs. actual 0x9dd37fb1edec5ce
[ 1390.736968] LKRG: WATCH: IOMMU table hash expected 0xffffffff vs. actual 0xffffffff
[ 1390.745026] LKRG: WATCH: Hash from 'module list' => [0x375627ad13ce085b]
[ 1390.752138] LKRG: WATCH: Hash from 'module kobj(s)' => [0x6aa6d1d4ee4d016d]
[ 1390.759523] LKRG: ALIVE: System is clean
solardiz commented 1 year ago

Also, on this system with these settings and after these events, unloading LKRG took half a minute, during which the system was unusable:

[ 1525.737551] LKRG: DYING: Unloading LKRG
[ 1525.761358] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 1525.770226] OOM killer disabled.
[ 1525.783593] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1525.790647] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[binfmt_misc : 0xffffffffc051d3c0]!
[ 1525.801396] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[lkrg : 0xffffffffc0892000]!
[ 1525.811568] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[intel_rapl_msr : 0xffffffffc0863400]!
[ 1525.822582] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[intel_rapl_common : 0xffffffffc0857580]!
[ 1525.833860] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[x86_pkg_temp_thermal : 0xffffffffc08f0180]!
[...~3500 lines skipped...]
[ 1563.531622] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[dm_region_hash : 0xffffffffc00e1040]!
[ 1563.542790] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[dm_log : 0xffffffffc00d7180]!
[ 1563.553224] LKRG: WATCH: [FTRACE] Updating module's core .text section hash module[dm_mod : 0xffffffffc00bae00]!
[ 1563.569494] LKRG: WATCH: Removing [kretprobe] <arch_jump_label_transform_apply> at 0xffffffffa262a260 nmissed[0]
[ 1563.586475] LKRG: WATCH: Removing [kretprobe] <ftrace_modify_all_code> at 0xffffffffa2805b00 nmissed[0]
[ 1563.610616] LKRG: WATCH: Removing [kretprobe] <ftrace_enable_sysctl> at 0xffffffffa280a5e0 nmissed[0]
[ 1563.632618] LKRG: WATCH: Removing [kretprobe] <arch_static_call_transform> at 0xffffffffa262ff00 nmissed[0]
[ 1563.660718] LKRG: WATCH: Removing [kretprobe] <lkrg_dummy> at 0xffffffffc0869480 nmissed[0]
[ 1563.678116] OOM killer enabled.
[ 1563.681510] Restarting tasks ... done.
[ 1563.694061] LKRG: DYING: LKRG unloaded

Looks like much of this time was spent on updating the .text section hash multiple times (I wonder why):

[ 1525.783593] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1526.695510] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1527.611617] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1528.527589] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1529.443513] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1530.358618] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1531.270899] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1532.189762] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1533.105657] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1534.019579] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1534.934715] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1535.849646] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1536.764584] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1537.684074] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1538.604979] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1539.520589] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1540.444981] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1541.360581] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1542.275619] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1543.190635] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1544.105575] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1545.020746] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1545.935670] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1546.851702] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1547.767537] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1548.683543] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1549.598762] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1550.515518] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1551.430554] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1552.346494] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1553.262579] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1554.178496] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1555.103870] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1556.026768] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1556.941770] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1557.856663] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1558.777807] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1559.692761] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1560.611657] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1561.761533] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!
[ 1562.673507] LKRG: WATCH: [FTRACE] Updating kernel core .text section hash!

Repeating this with almost no delay between the commands:

[root@c3-small-x86-01 lkrg]# insmod lkrg.ko kint_enforce=1 log_level=4
[root@c3-small-x86-01 lkrg]# rmmod lkrg

resulted in a similar system freeze again - roughly same duration and exactly 41 .text section hash updates again. When I drop log_level=4, no such lengthy freeze is seen:

[ 2134.857549] LKRG: DYING: Unloading LKRG
[ 2134.887824] Freezing user space processes ... (elapsed 0.000 seconds) done.
[ 2134.896685] OOM killer disabled.
[ 2135.598265] OOM killer enabled.
[ 2135.601707] Restarting tasks ... done.
[ 2135.606997] LKRG: DYING: LKRG unloaded
solardiz commented 1 year ago

BTW, looks like we have a sign extension bug in those messages.

We should get this fix in (tested on this same system, works):

+++ b/src/modules/integrity_timer/p_integrity_timer.c
@@ -280,8 +280,8 @@ void p_check_integrity(struct work_struct *p_work) {

    if (p_db.kernel_stext.p_hash != p_tmp_hash) {
 #if defined(P_LKRG_JUMP_LABEL_STEXT_DEBUG)
-      char *p_str1 = (unsigned char *)p_db.kernel_stext.p_addr;
-      char *p_str2 = (unsigned char *)p_db.kernel_stext_copy;
+      unsigned char *p_str1 = (unsigned char *)p_db.kernel_stext.p_addr;
+      unsigned char *p_str2 = (unsigned char *)p_db.kernel_stext_copy;
       char p_eh_buf[0x100];
 #endif
       /* We detected core kernel .text corruption - we are hacked and can't recover */
solardiz commented 1 year ago

BTW, looks like we have a sign extension bug in those messages.

We should get this fix in (tested on this same system, works)

I've just pushed this fix right into main as it's trivial and only for non-default debugging builds.

Adam-pi3 commented 1 year ago

I've just pushed the fix. Essentially, RHEL 9.1 (starting from kernels 5.14.0-162.12.1) but not RHEL9.0 (kernels 5.14.0-70.13.1) switch to use a new/modern text_poke_loc structure (so the layout is different). This was the reason of the problem and https://github.com/lkrg-org/lkrg/commit/0e46fc41a5b385ab89e8f444e26a8d5cf49d03e8 fixes the issue. I've verified RHEL 8.6, 8.7, 9.0 and 9.1 and the current logic seems to work fine.

solardiz commented 1 year ago

RHEL 9.1 (starting from kernels 5.14.0-162.12.1) but not RHEL9.0 (kernels 5.14.0-70.13.1) switch to use a new/modern text_poke_loc structure

Thank you for figuring this out and making the fix, @Adam-pi3.

I'm a bit concerned that the fix might break non-up-to-date RHEL 9.1 installs, in case the -162 kernel isn't the oldest one that has the RHEL version set to 9.1. There's little we can do about this, but we might want to know this in order to word the CHANGES entry appropriately.

Adam-pi3 commented 1 year ago

From my tests, when I installed RHEL9.0 (kernels 5.14.0-70.13.1) and try to update the kernel it will switch to 5.14.0-162.12.1 and automatically bump version to RHEL9.1. In other words, I don't think it is possible to have RHEL kernel version 9.1 with kernel lower than 5.14.0-162.12.1