mcusim / freebsd-src

sys/dev/dpaa2 drivers work-in-progress
https://www.FreeBSD.org/
Other
4 stars 3 forks source link

Panic in dpaa2_ni_poll_task #4

Closed mcbridematt closed 2 years ago

mcbridematt commented 2 years ago

Commit: 0e7b9be69

This happened to me just now. Same environment as #3.

I'm not sure if the textdump is helpful or not. I will try and setup my install to properly generate a minidump


Fatal data abort:
  x0: ffff00011cd4c000
  x1:                0
  x2: ffff00011d094020
  x3:              152
  x4:                0
  x5:                d
  x6: 9c58ea23fe5fba08
  x7:    8ebb800fc9c58
  x8: ffff00011d094000
  x9: ffff00011d094000
 x10:                3
 x11:                3
 x12:                0
 x13:                0
 x14:                0
 x15:                8
 x16: ffff00016862ad48 (__stop_set_modmetadata_set + 448)
 x17: ffff00000059264c (if_inc_counter + 0)
 x18: ffff00010e3b1840
 x19: ffff00011cd4c000
 x20: ffff00011c9f1000
 x21: ffffa00002d4ec00
 x22: ffff00000091a4b3 (console_pausestr + 366c)
 x23: ffff0000008c3829 (do_execve.fexecv_proc_title + 26782)
 x24: ffff000000bc1000 (sccs + 8)
 x25:                1
 x26:               25
 x27: ffff000000e71000 (epoch_array + 1e00)
 x28: ffff000000e1c0e0 (thread0_st + 0)
 x29: ffff00010e3b1840
  sp: ffff00010e3b1840
  lr: ffff0000007e83d8 (dpaa2_ni_poll_task + c0)
 elr: ffff0000007e84b4 (dpaa2_ni_poll_task + 19c)
spsr:         60000045
 far:                0
 esr:         96000004
panic: vm_fault failed: ffff0000007e84b4 error 1
cpuid = 3
time = 1651998794
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2f0
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x96000004
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x19c
taskqueue_run_locked() at taskqueue_run_locked+0x178
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
mcbridematt commented 2 years ago

I can almost consistently trigger this by running 2 x iperf3 flows through the machine simultaneously (Two ports "client", two ports "server")

I can't get a exact backtrace from the minidump (the stack disappears due to the fault), but I believe the problem is the call to fq->consume on line 2565 in 0e0f350b0d5346a0761c0e485893ea36bce0b742. This is inside dpaa2_ni_consume_frames which appears to have been inlined inside dpaa2_ni_poll_task.

https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L2561-L2567

I can upload the vmcore / minidump if you would like.

dsalychev commented 2 years ago

fq->consume points to one of those three functions: https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1098-L1106 https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1109-L1127 https://github.com/mcusim/freebsd-src/blob/0e0f350b0d5346a0761c0e485893ea36bce0b742/sys/dev/dpaa2/dpaa2_ni.c#L1129-L1138 Could you add a debug output to understand which one was called at the moment when the kernel panicked?

mcbridematt commented 2 years ago

I have remote (serial) debug working with kgdb now, so I can inspect the variables in dpaa2_ni_consume_frames when the panic occurs.

fq is 0 (NULL) which is why this results in an exception/abort. In fact, most values of dq->fdr (where fq) is are:

print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}
(kgdb) bt
#0  kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/freebsd-src/sys/kern/subr_kdb.c:506
#1  0xffff000000468ce8 in vpanic (fmt=0xffff00000091e329 "vm_fault failed: %lx error %d", ap=...) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:963
#2  0xffff000000468a58 in panic (fmt=0x12 <error: Cannot access memory at address 0x12>) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:899
#3  0xffff000000776150 in data_abort (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0, esr=2516582404, far=<optimized out>, lower=0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:364
#4  0xffff0000007750e0 in do_el1h_sync (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:462
#5  <signal handler called>
#6  0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
#7  dpaa2_ni_poll_task (arg=0xffff000101d8e000, count=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2423
#8  0xffff0000004c9718 in taskqueue_run_locked (queue=queue@entry=0xffffa0000a85c300) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:477
#9  0xffff0000004ca6c8 in taskqueue_thread_loop (arg=arg@entry=0xffff000101d86798) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:794
#10 0xffff00000041fdf8 in fork_exit (callout=0xffff0000004ca5fc <taskqueue_thread_loop>, arg=0xffff000101d86798, frame=0xffff0000f9fd6990) at /usr/src/freebsd-src/sys/kern/kern_fork.c:1102
#11 <signal handler called>
(kgdb) frame 6
#6  0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565                                    KASSERT((
(kgdb) info locals
retries = 1
fq = 0x0
rc = 37
dq = 0xffff0001020d8000
fd = 0xffff0001020d8020
frames = <optimized out>
(kgdb) print *dq
$8 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$9 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}

I added a KASSERT((fq != 0),...) just before fq->consume to confirm this, and the assert trigged as expected:

panic: dpaa2_ni_consume_frames: fq is 0
cpuid = 7
time = 1652643273
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
kdb_backtrace() at kdb_backtrace+0x38
vpanic() at vpanic+0x17c
panic() at panic+0x44
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x2c0
taskqueue_run_locked() at taskqueue_run_locked+0xac
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic

(kgdb) frame 3
#3  0xffff0000007d167c in dpaa2_ni_consume_frames (chan=0xffff0000fca59000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565                                    KASSERT((fq != 0), ("%s: fq is 0", __func__));
(kgdb) print fq
$1 = <optimized out>
(kgdb) info locals
retries = 0
fq = <optimized out>
rc = 37
dq = 0xffff0000fcda1000
fd = 0xffff0000fcda1020
frames = <optimized out>
(kgdb) print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$3 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}
dsalychev commented 2 years ago

Hmm, verb = 96 (0x60) means Frame Dequeue Response, but stat = 3 means that the FQ is empty and no frames were delivered to this Volatile Dequeue command. It's just a null response I haven't checked. Please, try 173aa2a1fe738a17dfa5996f4a59032542fb192b.

mcbridematt commented 2 years ago

I think that fixed it! I've run the 4-port iperf stresstest for several long intervals, including a 10 hour one that just completed.

dpni0  1.5K <Link#1>      00:0a:fa:24:24:a3        0     0     0        42G        0     0          0     0
dpni1  1.5K <Link#2>      00:0a:fa:24:24:a4        0     0     0        43G        0     0          0     0
dpni2  1.5K <Link#3>      00:0a:fa:24:24:a5        0     0     0       2.4T        0     0          0     0
dpni3  1.5K <Link#4>      00:0a:fa:24:24:a6        0     0     0       2.4T        0     0          0     0
bridg  1.5K <Link#12>     58:9c:fc:10:0f:76     1.4G     0     0        85G     3.6G     0       4.9T     0

The next panic I have come across is #8 but that appears to be more sporadic.