lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
414 stars 72 forks source link

Intermittent NULL pointer dereference after module insert on Ubuntu 5.15.0-18-generic #156

Open solardiz opened 2 years ago

solardiz commented 2 years ago

https://github.com/lkrg-org/lkrg/runs/5029333618?check_suite_focus=true

[    5.630964] systemd[1]: Started Journal Service.
[  OK  ] Started Journal Service.
         Starting Create Volatile Files and Directories...
[  OK  ] Finished Create Volatile Files and Directories.
[  OK  ] Finished dracut cmdline hook.
         Starting dracut pre-udev hook...
[    6.578890] p_lkrg: loading out-of-tree module taints kernel.
[    6.580457] p_lkrg: module verification failed: signature and/or required key missing - tainting kernel
[    6.677442] [p_lkrg] Loading LKRG...
[    6.678082] [p_lkrg] System does NOT support SMEP. LKRG can't enforce SMEP validation :(
[    6.680660] [p_lkrg] System does NOT support SMAP. LKRG can't enforce SMAP validation :(
[    6.698104] Freezing user space processes ... (elapsed 0.001 seconds) done.
[    6.701595] OOM killer disabled.
[    7.691436] [p_lkrg] [kretprobe] register_kretprobe() for <ovl_create_or_link> failed! [err=-2]
[    7.692863] [p_lkrg] Can't hook 'ovl_create_or_link' function. This is expected if you are not using OverlayFS.
[    7.816497] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    7.819667] #PF: supervisor instruction fetch in kernel mode
[    7.819667] #PF: error_code(0x0010) - not-present page
[    7.819667] PGD 0 P4D 0 
[    7.819667] Oops: 0010 [#1] SMP NOPTI
ABORT
[    7.819667] CPU: 0 PID: 12 Comm: rcu_sched Tainted: G           OE     5.15.0-18-generic #18-Ubuntu
[    7.819667] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[    7.819667] RIP: 0010:0x0
[    7.819667] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[    7.819667] RSP: 0018:ffffba53c006fe08 EFLAGS: 00000086
[    7.819667] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    7.819667] RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffff9d8f7ec20c00
[    7.819667] RBP: ffffba53c006fe60 R08: 0000000000000000 R09: 0000000000000000
[    7.819667] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000fffee23f
[    7.819667] R13: ffffba53c006fe18 R14: ffff9d8f41205d00 R15: ffffffffb8f7f240
[    7.819667] FS:  0000000000000000(0000) GS:ffff9d8f7ec00000(0000) knlGS:0000000000000000
[    7.819667] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.819667] CR2: ffffffffffffffd6 CR3: 00000000025ce000 CR4: 00000000000006f0
[    7.819667] Call Trace:
[    7.819667]  <TASK>
[    7.819667]  elfcorehdr_read+0x40/0x40
[    7.819667]  ? __bpf_trace_tick_stop+0x10/0x10
[    7.819667]  rcu_gp_fqs_loop+0xe5/0x320
[    7.819667]  rcu_gp_kthread+0xa7/0x130
[    7.819667]  kthread+0x11e/0x140
[    7.819667]  ? rcu_gp_init+0x5f0/0x5f0
[    7.819667]  ? set_kthread_struct+0x50/0x50
[    7.819667]  ret_from_fork+0x22/0x30
[    7.819667]  </TASK>
[    7.819667] Modules linked in: p_lkrg(OE+) dm_mirror dm_region_hash dm_log virtio_rng autofs4
[    7.819667] CR2: 0000000000000000
[    7.819667] ---[ end trace 8b4415ca7fe1f2d2 ]---
[    7.819667] RIP: 0010:0x0
[    7.819667] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[    7.819667] RSP: 0018:ffffba53c006fe08 EFLAGS: 00000086
[    7.819667] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    7.819667] RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffff9d8f7ec20c00
[    7.819667] RBP: ffffba53c006fe60 R08: 0000000000000000 R09: 0000000000000000
[    7.819667] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000fffee23f
[    7.819667] R13: ffffba53c006fe18 R14: ffff9d8f41205d00 R15: ffffffffb8f7f240
[    7.819667] FS:  0000000000000000(0000) GS:ffff9d8f7ec00000(0000) knlGS:0000000000000000
[    7.819667] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.819667] CR2: ffffffffffffffd6 CR3: 00000000025ce000 CR4: 00000000000006f0
[    7.819667] Kernel panic - not syncing: Fatal exception
[    7.819667] Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Adam-pi3 commented 2 years ago

Can we try to load LKRG with log_level=3? From this log we can't tell at what stage of loading can be problematic

solardiz commented 2 years ago

I'm a bit concerned that if we use non-default/debugging settings for LKRG in many of the CI runs, we might miss an issue that only affects real/default/non-debugging uses. Anyway, yes, I suppose we can update the CI setup to enable log_level=3 for mkosi-boot (jammy). No idea if the problem will occur again, and if it will reoccur with this same distro vs. some other. I re-ran this failed job, and it succeeded. I think we'll soon need to setup some stress-testing using settings similar to what we use for CI, but on our/rented hardware, and thus running those jobs in tight loops as necessary to reproduce issues promptly.

Adam-pi3 commented 2 years ago

I think that this issue is another case of https://github.com/lkrg-org/lkrg/issues/87 and that's why I think log_level=3 maybe can help gives us more information - it looks not related to LKRG itself.