Net: Implement deferred panic

solardiz commented 4 months ago

Nov 10, 2022

When LKRG decides to panic the kernel and we have networking enabled, the panic should be deferred until after we've at least tried sending the message out.

However, in such configuration we should probably apply the previous level of enforcement (where applicable) right away, so that switching from that level to panic obviously does not weaken security. For example, kill the task right away (where applicable), then send the message that we're about to panic the kernel, then actually panic.

solardiz commented 4 months ago

Nov 23, 2022 (which was before we implemented kprobe self-test via 26f36ed495ffeb8befe73f5c5f152603ab311076)

Without deferred panic yet, testing with echo 0 > /sys/kernel/debug/kprobes/enabled, I am getting this transferred to the remote (tested this twice - once with kernel.panic=-1 and once with kernel.panic=0 - same result):

1669221616777553,1669221579179197,404779645,6,317,404779215,-;Kprobes globally disabled
1669221626697537,1669221589102050,414702497,2,318,414702399,-;LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly

whereas the full messages would be:

[  404.779215] Kprobes globally disabled
[  414.702399] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  414.707913] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  414.707949] LKRG: ALERT: BLOCK: Kernel: 1 checksums changed unexpectedly
[  414.707988] Kernel panic - not syncing: Kernel: 1 checksums changed unexpectedly

followed by a backtrace.

As to deferred panic, I am wondering whether we should limit that to LKRG-induced panics or maybe hook into the kernel's panic code (or something it calls) and similarly defer non-LKRG panics (only the final stopping/rebooting, but not the messages). For example, we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register (an exported symbol across our supported kernels).

solardiz commented 4 months ago

Nov 28, 2022

we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register

I've just experimented with this. First, by code review those callbacks are made too late for us - after shutdown of SMP, whereas we'd want our network sending code to run on another CPU because the one panic'ing is in an unsuitable state (was already in an unknown state, and is further modified by the panic in progress). Second, in my testing the callback is somehow not called at all - which I couldn't figure out yet.

lkrg-org / lkrg

Net: Implement deferred panic #314