Open jfernandez opened 1 month ago
Could you please share:
task = prog.crashed_thread().object
print("pid =", task.pid)
cpu = task_cpu(task)
print("cpu =", cpu)
print("on_cpu =", task.on_cpu)
print("curr =", cpu_curr(prog, cpu).pid)
print(prog.stack_trace(task.pid))
eu-readelf -n $your_vmcore
. You may need to install elfutils.Ultimately, what I'm trying to do is understand why the kernel could not read ffffffffffffff20.
scripts/decodecode
in the kernel tree disassembles the crashing instruction to:
4d 8b b4 d6 20 ff ff mov -0xe0(%r14,%rdx,8),%r14
And -0xe0 is 0xffffffffffffff20. Based on r14 and rdx in the register dump, I'm going to guess there's something like
container_of(foo, ...)->bar
in the code and foo
is NULL
.
Oh, a couple more things to check in drgn:
prog.symbol("mt792x_mac_link_bss_remove")
prog["mt792x_mac_link_bss_remove"]
I'm wondering if drgn couldn't find the debugging symbols for the driver module.
Full command:
sudo drgn -c /var/crash/kdumpst/crash/vmcore.202408031440 -s /usr/src/debug/linux-upstream/vmlinux
Full output. FYI, originally my modules were compressed, and I disabled compression.
❯ sudo drgn -c /var/crash/kdumpst/crash/vmcore.202408031440 -s /usr/src/debug/linux-upstream/vmlinux
drgn 0.0.27 (using Python 3.12.4, elfutils 0.191, with libkdumpfile)
warning: missing some debugging symbols (see https://drgn.readthedocs.io/en/latest/getting_debugging_symbols.html):
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/fs/fat/vfat.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e84c5be70)
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/drivers/input/joydev.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e8433e750)
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/fs/fat/fat.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e97df8080)
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/drivers/input/mousedev.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e816263c0)
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/sound/soc/amd/ps/snd-soc-ps-mach.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e82140f50)
... 148 more
For help, type help(drgn).
>>> import drgn
>>> from drgn import FaultError, NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof, stack_trace
>>> from drgn.helpers.common import *
>>> from drgn.helpers.linux import *
>>>
The output of running the following in drgn
>> task = prog.crashed_thread().object >> print("pid =", task.pid) pid = (pid_t)905 >> cpu = task_cpu(task) >> print("cpu =", cpu) cpu = 15 >> print("on_cpu =", task.on_cpu) on_cpu = (int)1 >> print("curr =", cpu_curr(prog, cpu).pid) curr = (pid_t)905 >> print(prog.stack_trace(task.pid)) #0 0xffffffffc10ff5f3 #1 0x0 #2 0x0 #3 0x0 #4 0x0 #5 0x203ff000f0000
The output of eu-readelf -n $your_vmcore. You may need to install elfutils.
❯ ls vmcore.202408031440 vmcore.202408031605 ❯ sudo eu-readelf -n vmcore.202408031440 eu-readelf: failed reading 'vmcore.202408031440': not a valid ELF file
I'm wondering if drgn couldn't find the debugging symbols for the driver module.
>> prog.symbol("mt792x_mac_link_bss_remove") Traceback (most recent call last): File "<console>", line 1, in <module> LookupError: could not find symbol with name 'mt792x_mac_link_bss_remove' >> prog["mt792x_mac_link_bss_remove"] Traceback (most recent call last): File "<console>", line 1, in <module> KeyError: 'mt792x_mac_link_bss_remove'
And yes, you are absolutely right. mconf->vif
is NULL below. I tracked it down using print statements, but I would love to be able to diagnose this with drgn:
struct ieee80211_vif *vif = container_of((void *)mconf->vif,
struct ieee80211_vif, drv_priv);
Ok these lines are the key:
/lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/kernel/fs/fat/vfat.ko (could not get section addresses: could not read memory from kdump: Excluded page: 0xffff9a2e84c5be70)
Either makedumpfile/kdumpst excluded some memory that drgn needs to find kernel modules, or drgn has a bug. I can dig in next week. If there's any chance you could share the core dump and your /lib/modules/6.11.0-rc1-00001-gb30fffe019cc-dirty/
directory, that would help, but I totally understand if you're not able to. Feel free to reach me via email if so.
These are my kdump settings for kdumpst:
# Kdump controlling settings
# Currently we only do local storage log collection (no network/iscsi dumps).
# If FULL_COREDUMP is !=0, we collect a full compressed vmcore, which might
# require a lot of disk space. The MAKEDUMPFILE_*_CMD settings refer to
# tunings on makedumpfile - we rely on zstd compression and maximum page
# exclusion for the full vmcore, mimic'ing Debian/Ubuntu kdump. We also
# base on Debian/Ubuntu for the KDUMP_CMDLINE_APPEND option - this contains
# the kernel parameters we append in the /proc/cmdline for the kdump kernel;
# the most important parameters are nr_cpus=1 (to save RAM memory usage and
# avoid some potential issues with SMP) and reset_devices (some drivers
# rely on that for proper kdump).
FULL_COREDUMP=1
MAKEDUMPFILE_COREDUMP_CMD="-z -d 31"
MAKEDUMPFILE_DMESG_CMD="--dump-dmesg"
KDUMP_APPEND_CMDLINE="panic=-1 oops=panic fsck.mode=force fsck.repair=yes nr_cpus=1 reset_devices"
I will email you the core dump and the modules dir. Thank you!
Since this is a 6.11 based kernel, you may want to check your makedumpfile version. The following commit is required in order to include slab pages: https://github.com/makedumpfile/makedumpfile/commit/bad2a7c4fa75d37a41578441468584963028bdda
There is no makedumpfile release containing this commit yet, so you'd need to build it yourself.
@brenns10 thanks for surfacing that. I’ll build makedumpfile from HEAD and report back.
The CachyOS folks published a new version of makedumpfile
with the patch. Unfortunately, this didn't seem to fix the issue, I still got the warnings about the kernel modules and no stack trace.
One thing I didn't catch before was this warning about the kernel not being supported, and the dump being possibly incomplete. It's unclear to me if this is kdump or kdumpst.
It looks like that is coming from makedumpfile: https://github.com/makedumpfile/makedumpfile/blob/900190de6b67b2de410cfc8023c1b198a416ceb3/makedumpfile.c#L1185
I went ahead and used virtme-ng to build v6.11-rc5, and I also compiled makedumpfile from head. I ran with:
vng --rw --user root -a crashkernel=256M
Within the VM I loaded a panic kernel and then triggered a panic:
kexec -p arch/x86/boot/bzImage --append="$(cat /proc/cmdline)"
echo c >/proc/sysrq-trigger
The second kernel booted and then I used my build of makedumpfile to create a vmcore:
bash-5.1# /home/stepbren/repos/makedumpfile/makedumpfile -c -d 31 /proc/vmcore vmcore.img
The kernel version is not supported.
The makedumpfile operation may be incomplete.
Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Checking for memory holes : [100.0 %] - Checking for memory holes : [100.0 %] / Excluding unnecessary pages : [100.0 %] | Copying data : [ 25.4 %] \ eta: Copying data : [ 57.8 %] - eta: Copying data : [ 75.8 %] / eta: Copying data : [100.0 %] | eta: Copying data : [100.0 %] \ eta: 0s
The dumpfile is saved to vmcore.img.
makedumpfile Completed.
With drgn 0.0.27 I had no trouble loading that vmcore and reading the crashed thread:
$ drgn -c vmcore.img -s vmlinux
drgn 0.0.27 (using Python 3.9.18, elfutils 0.190, with libkdumpfile)
For help, type help(drgn).
>>> import drgn
>>> from drgn import FaultError, NULL, Object, cast, container_of, execscript, offsetof, reinterpret, sizeof, stack_trace
>>> from drgn.helpers.common import *
>>> from drgn.helpers.linux import *
>>> prog.crashed_thread()
<_drgn.Thread object at 0x7f4235e8ccb0>
>>> prog.crashed_thread().stack_trace()
#0 crash_setup_regs (./arch/x86/include/asm/kexec.h:114:15)
#1 __crash_kexec (kernel/crash_core.c:119:4)
#2 panic (kernel/panic.c:373:3)
#3 sysrq_handle_crash (drivers/tty/sysrq.c:154:2)
#4 __handle_sysrq (drivers/tty/sysrq.c:612:4)
#5 write_sysrq_trigger (drivers/tty/sysrq.c:1181:4)
#6 pde_write (fs/proc/inode.c:334:10)
#7 proc_reg_write (fs/proc/inode.c:346:8)
#8 vfs_write (fs/read_write.c:588:9)
#9 ksys_write (fs/read_write.c:643:9)
#10 do_syscall_x64 (arch/x86/entry/common.c:52:14)
#11 do_syscall_64 (arch/x86/entry/common.c:83:7)
#12 entry_SYSCALL_64+0xb0/0x14d (arch/x86/entry/entry_64.S:121)
#13 0x7f86f150cad7
>>>
Obviously that's not to say that you're not still having an issue, but I don't think the issue is anywhere in general makedumpfile or drgn support for kernel 6.11. If you're still having an issue and you're comfortable sharing the core dump privately, I could probably take a look and see if I notice anything odd about it.
The "kernel version is not supported" message is definitely just a preemptive message that makedumpfile prints for kernels whose version is greater than the max they have tested. It doesn't indicate that anything has actually gone wrong, just that you're in uncharted territory.
I'm debugging a crash at boot time with a wifi driver. I captured the kdump and dmesg output with kdumpst, see below. The call trace shows that the failed task was 905.
I used drgn to load this vmcore with the corresponding vmlinux, and I can't see the same stack trace as the dmesg log:
Ultimately, what I'm trying to do is understand why the kernel could not read
ffffffffffffff20
. I've also tried to read that memory directly:dmesg output for the crash: