Closed mcbridematt closed 2 years ago
I can almost consistently trigger this by running 2 x iperf3 flows through the machine simultaneously (Two ports "client", two ports "server")
I can't get a exact backtrace from the minidump (the stack disappears due to the fault), but I believe the problem is the call to fq->consume
on line 2565 in 0e0f350b0d5346a0761c0e485893ea36bce0b742.
This is inside dpaa2_ni_consume_frames
which appears to have been inlined inside dpaa2_ni_poll_task
.
I can upload the vmcore / minidump if you would like.
fq->consume
points to one of those three functions:
https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1098-L1106
https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1109-L1127
https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1129-L1138
Could you add a debug output to understand which one was called at the moment when the kernel panicked?
I have remote (serial) debug working with kgdb now, so I can inspect the variables in dpaa2_ni_consume_frames
when the panic occurs.
fq is 0 (NULL) which is why this results in an exception/abort. In fact, most values of dq->fdr (where fq) is are:
print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
rid_tok = 0, ctx = 0}}}
(kgdb) bt
#0 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/freebsd-src/sys/kern/subr_kdb.c:506
#1 0xffff000000468ce8 in vpanic (fmt=0xffff00000091e329 "vm_fault failed: %lx error %d", ap=...) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:963
#2 0xffff000000468a58 in panic (fmt=0x12 <error: Cannot access memory at address 0x12>) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:899
#3 0xffff000000776150 in data_abort (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0, esr=2516582404, far=<optimized out>, lower=0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:364
#4 0xffff0000007750e0 in do_el1h_sync (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:462
#5 <signal handler called>
#6 0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
#7 dpaa2_ni_poll_task (arg=0xffff000101d8e000, count=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2423
#8 0xffff0000004c9718 in taskqueue_run_locked (queue=queue@entry=0xffffa0000a85c300) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:477
#9 0xffff0000004ca6c8 in taskqueue_thread_loop (arg=arg@entry=0xffff000101d86798) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:794
#10 0xffff00000041fdf8 in fork_exit (callout=0xffff0000004ca5fc <taskqueue_thread_loop>, arg=0xffff000101d86798, frame=0xffff0000f9fd6990) at /usr/src/freebsd-src/sys/kern/kern_fork.c:1102
#11 <signal handler called>
(kgdb) frame 6
#6 0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565 KASSERT((
(kgdb) info locals
retries = 1
fq = 0x0
rc = 37
dq = 0xffff0001020d8000
fd = 0xffff0001020d8020
frames = <optimized out>
(kgdb) print *dq
$8 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$9 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}
I added a KASSERT((fq != 0),...)
just before fq->consume to confirm this, and the assert trigged as expected:
panic: dpaa2_ni_consume_frames: fq is 0
cpuid = 7
time = 1652643273
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
kdb_backtrace() at kdb_backtrace+0x38
vpanic() at vpanic+0x17c
panic() at panic+0x44
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x2c0
taskqueue_run_locked() at taskqueue_run_locked+0xac
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
(kgdb) frame 3
#3 0xffff0000007d167c in dpaa2_ni_consume_frames (chan=0xffff0000fca59000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565 KASSERT((fq != 0), ("%s: fq is 0", __func__));
(kgdb) print fq
$1 = <optimized out>
(kgdb) info locals
retries = 0
fq = <optimized out>
rc = 37
dq = 0xffff0000fcda1000
fd = 0xffff0000fcda1020
frames = <optimized out>
(kgdb) print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$3 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}
Hmm, verb = 96 (0x60)
means Frame Dequeue Response, but stat = 3
means that the FQ is empty and no frames were delivered to this Volatile Dequeue command. It's just a null response I haven't checked. Please, try 173aa2a1fe738a17dfa5996f4a59032542fb192b.
I think that fixed it! I've run the 4-port iperf stresstest for several long intervals, including a 10 hour one that just completed.
dpni0 1.5K <Link#1> 00:0a:fa:24:24:a3 0 0 0 42G 0 0 0 0
dpni1 1.5K <Link#2> 00:0a:fa:24:24:a4 0 0 0 43G 0 0 0 0
dpni2 1.5K <Link#3> 00:0a:fa:24:24:a5 0 0 0 2.4T 0 0 0 0
dpni3 1.5K <Link#4> 00:0a:fa:24:24:a6 0 0 0 2.4T 0 0 0 0
bridg 1.5K <Link#12> 58:9c:fc:10:0f:76 1.4G 0 0 85G 3.6G 0 4.9T 0
The next panic I have come across is #8 but that appears to be more sporadic.
Commit: 0e7b9be69
This happened to me just now. Same environment as #3.
I'm not sure if the textdump is helpful or not. I will try and setup my install to properly generate a minidump