mcusim / freebsd-src

sys/dev/dpaa2 drivers work-in-progress
https://www.FreeBSD.org/
Other
4 stars 3 forks source link

kernel panics #20

Closed snail59 closed 1 year ago

snail59 commented 1 year ago

Hello,

Recently, my ten64 started crashing multiple times per day.

dpni9 and dpni8 are 10Gb nics; in my case, dpni9 is facing internet while dpni8 serves my local network. Plus, I use vlans, mostly for vmware virtual machines.

Communication between internet and my local network works pretty fine, but

I tried different scenarios:

1) dpni8 form my local network + vlans As soon as VM ( so in a vlan) begins to communicate with any other network, the ten64 crashes:

 Fatal data abort:
  x0: ffffa0001f96d800
  x1:                0
  x2:                2
  x3:       80893c5000
  x4: ffff0000008d26ac (generic_bs_w_4 + 0)
  x5: ffff0000f85cb820 (_DYNAMIC + f6bba868)
  x6:                0
  x7:                0
  x8:             40c0
  x9: ffff000161bc50c0 (__stop_set_sysinit_set + 1180d20)
 x10:              5ea
 x11:              5ea
 x12:                1
 x13:             2af8
 x14:               12
 x15:             2af8
 x16:             28b2
 x17:             28b1
 x18: ffff0000f85cb6a0 (_DYNAMIC + f6bba6e8)
 x19: ffff00011474b000
 x20: ffff000112429000
 x21: ffff00011474b100
 x22: ffff000114e07020
 x23: ffffa0001f96d800
 x24:                0
 x25: ffff000112459520
 x26:                0
 x27: ffff000000d9c618 (Giant + 18)
 x28: ffffa000058d9a80
 x29: ffff0000f85cb6e0 (_DYNAMIC + f6bba728)
  sp: ffff0000f85cb6a0
  lr: ffff00000092be84 (dpaa2_ni_rx + f0)
 elr: ffff00000092beb0 (dpaa2_ni_rx + 11c)
spsr:               45
 far:               10
 esr:         96000044
panic: vm_fault failed: ffff00000092beb0 error 1
cpuid = 0
time = 1686765529
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x308
handle_el1h_sync() at handle_el1h_sync+0x14
--- exception, esr 0x96000044
dpaa2_ni_rx() at dpaa2_ni_rx+0x11c
dpaa2_ni_poll() at dpaa2_ni_poll+0x84
dpaa2_io_intr() at dpaa2_io_intr+0x16c
ithread_loop() at ithread_loop+0x3fc
fork_exit() at fork_exit+0x88
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100126 ]
Stopped at      kdb_enter+0x44: str     xzr, [x19, #1152]

2) dpni 8 form local network and dpni5 for vlans This is a less unstable configuration, but it eventually crashes anyway:

panic: dpaa2_ni_rx: unexpected physical address: fd(0xa2ec8000) != buf(0xa3564000)
cpuid = 7
time = 1686903243
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
dpaa2_ni_rx() at dpaa2_ni_rx+0x2a0
dpaa2_ni_poll() at dpaa2_ni_poll+0x84
dpaa2_io_intr() at dpaa2_io_intr+0x16c
ithread_loop() at ithread_loop+0x3fc
fork_exit() at fork_exit+0x88
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100119 ]
Stopped at      kdb_enter+0x44: str     xzr, [x19, #1152]

3) using 13.2 It quickly crashes, even without activity

Fatal data abort:
  x0:                0
  x1:                0
  x2:                0
  x3:               40
  x4:               3f
  x5: ffff0000e6cf7000
  x6:  a000cfea4bfa322
  x7: 20450008c22724fa
  x8: ffff000000f34000
  x9: ffffa00000000000
 x10: 7784f7f643fcc836
 x11: 3fc367d3c6c5532f
 x12: 4eaef46f559c1721
 x13: d038e45d25e875d2
 x14: 927044e716003300
 x15: 188034c38f645e30
 x16:  1010000f61e2170
 x17: e73c84cf86bd0a08
 x18: ffff00015d2fded0
 x19: ffffa00002270100
 x20: ffffa0002114dc00
 x21: ffffa0001942f400
 x22: ffffa00040a6a000
 x23:                4
 x24: ffff0001137e8000
 x25: ffff000113899100
 x26: ffff00011389a340
 x27: ffff000113899120
 x28:               34
 x29: ffff00015d2fded0
  sp: ffff00015d2fded0
  lr: ffff0000007cf904
 elr: ffff0000007eaeec
spsr:         80000045
 far:               30
 esr:         96000004
panic: vm_fault failed: ffff0000007eaeec
cpuid = 2
time = 1686762562
KDB: stack backtrace:
#0 0xffff0000004fd02c at kdb_backtrace+0x60
#1 0xffff0000004a8328 at vpanic+0x13c
#2 0xffff0000004a81e8 at panic+0x44
#3 0xffff0000007f42e0 at data_abort+0x200
#4 0xffff0000007d3010 at handle_el1h_sync+0x10
#5 0xffff0000007cf900 at bounce_bus_dmamap_sync+0x74
#6 0xffff0000007cf900 at bounce_bus_dmamap_sync+0x74
#7 0xffff00000081b0dc at dpaa2_ni_transmit+0x3c4
#8 0xffff0000005df3c4 at ether_output_frame+0xd4
#9 0xffff0000005df200 at ether_output+0x664
#10 0xffff00000063b258 at ip_output+0x1320
#11 0xffff000000655ab4 at tcp_output+0x1e8c
#12 0xffff00000066b788 at tcp_usr_send+0x1f4
#13 0xffff000000557c4c at sosend_generic+0x598
#14 0xffff000000558364 at sosend+0x3c
#15 0xffff00000052c978 at soo_write+0x44
#16 0xffff000000521fb0 at dofilewrite+0x7c
#17 0xffff000000521a98 at sys_write+0xb8

Tell me if you need more information or tests from me. My build machine is fast so it does not bother me to compile multiple times.

dsalychev commented 1 year ago

Thanks for reporting. It looks like #19 in terms of a root cause, at least. Please, try https://github.com/mcusim/freebsd-src/issues/19#issuecomment-1555888989. Panics won't go away entirely, but you shouldn't see crashes so often. In the meantime, I'm trying to solve the root cause itself.

snail59 commented 1 year ago

OK Dmitry. I had seen this bug but thought it was another problem. Sorry.

For now, I am going to try and will let you know.

snail59 commented 1 year ago

I reverted 718bdb6 and it has been working waaaay better since. So you are certainly right, the problem is the same as the other issue.

Do you want me to close this issue ?

I humbly ask you what you think about reverting the commit in the source code for now, as it would prevent other people to reach this problem. This is up to you of course.

dsalychev commented 1 year ago

Do you want me to close this issue ?

I'll close it as a duplicate of #19.

I humbly ask you what you think about reverting the commit in the source code for now, as it would prevent other people to reach this problem. This is up to you of course.

It's important to unmask the panic with https://github.com/mcusim/freebsd-src/commit/718bdb6a71ba4ed1f557f89af1482a10f7b1cb74 because it'll help me to verify the root cause solved with upcoming patches.

dsalychev commented 1 year ago

@snail59 Please, try https://github.com/mcusim/freebsd-src/tree/dpaa2. GENERIC kernel had worked for me for ~14 hours under high network load till the moment I stopped the test myself.

details: https://github.com/mcusim/freebsd-src/issues/19#issuecomment-1651444388

snail59 commented 1 year ago

@dsalychev I just saw your email. I will test and let you know

snail59 commented 1 year ago

So, I tried it and quickly got a kernel panic:

Fatal data abort:
  x0: 0xffffa0000da24000
  x1: 0xffffa00014b58600
  x2: 0x000000000000000e
  x3: 0xffff0000f853d598 (_DYNAMIC + 0xf6b1e5e0)
  x4: 0xffff0000f853d3c6 (_DYNAMIC + 0xf6b1e40e)
  x5: 0xffffa00014b58670
  x6: 0x0a000cfea4bfa322
  x7: 0x0008c12724fa0a00
  x8: 0xffffa0000da24000
  x9: 0x0000000000000000
 x10: 0x000000000000004a
 x11: 0xffffa00014b58660
 x12: 0x0000000000000000
 x13: 0x0000000000000000
 x14: 0xffff0000f853d358 (_DYNAMIC + 0xf6b1e3a0)
 x15: 0x000000004f3c790b
 x16: 0x0000000000000008
 x17: 0xffffa0001406eae7
 x18: 0xffff0000f853d340 (_DYNAMIC + 0xf6b1e388)
 x19: 0xffffa00014b58600
 x20: 0xffffa0000da24000
 x21: 0x0000000000000000
 x22: 0xffff0000f853d578 (_DYNAMIC + 0xf6b1e5c0)
 x23: 0x000000000000000e
 x24: 0xffff0000f853d3b8 (_DYNAMIC + 0xf6b1e400)
 x25: 0x000000000000000e
 x26: 0x0000000000000008
 x27: 0x0000000000000000
 x28: 0x000000003300a8c0
 x29: 0xffff0000f853d340 (_DYNAMIC + 0xf6b1e388)
  sp: 0xffff0000f853d340
  lr: 0xffff000000929020 (dpaa2_ni_transmit + 0x38)
 elr: 0xffff0000009290a0 (dpaa2_ni_transmit + 0xb8)
spsr: 0x0000000040000045
 far: 0x0000000000002ee8
 esr: 0x0000000096000004
panic: vm_fault failed: 0xffff0000009290a0 error 1
cpuid = 5
time = 1690627307
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x308
handle_el1h_sync() at handle_el1h_sync+0x14
--- exception, esr 0x96000004
dpaa2_ni_transmit() at dpaa2_ni_transmit+0xb8
ether_output_frame() at ether_output_frame+0xd0
ether_output() at ether_output+0x664
ip_output_send() at ip_output_send+0xe8
ip_output() at ip_output+0x1394
ip_forward() at ip_forward+0x474
ip_input() at ip_input+0x924
netisr_dispatch_src() at netisr_dispatch_src+0xf0
ether_demux() at ether_demux+0x158
ether_nh_input() at ether_nh_input+0x39c
netisr_dispatch_src() at netisr_dispatch_src+0xf0
ether_input() at ether_input+0x48
uether_rxflush() at uether_rxflush+0x98
cdce_ncm_bulk_read_callback() at cdce_ncm_bulk_read_callback+0xb0
usbd_callback_wrapper() at usbd_callback_wrapper+0x6cc
usb_command_wrapper() at usb_command_wrapper+0x84
usb_callback_proc() at usb_callback_proc+0x16c
usb_process() at usb_process+0x124
fork_exit() at fork_exit+0x88
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 15 tid 100103 ]
Stopped at      kdb_enter+0x44: str     xzr, [x19, #1152]
snail59 commented 1 year ago

I tried rebuilding everything once again and booting without the USB modem but it still panics

  Fatal data abort:
  x0: 0x0000000000000000
  x1: 0xffffa00014bf0600
  x2: 0x000000000000000e
  x3: 0xffff000102640828
  x4: 0xffff0001026406f6
  x5: 0xffffa00014bf0670
  x6: 0x0a000cfea4bfa322
  x7: 0x0008c12724fa0a00
  x8: 0x0000000000000000
  x9: 0x0000000000000000
 x10: 0x0000000000000036
 x11: 0xffffa00014bf0660
 x12: 0x0000000000000008
 x13: 0xffffa00014e217e0
 x14: 0x0000000700000000
 x15: 0x0000000000000039
 x16: 0xffff00010264075f
 x17: 0xffffa00014e18267
 x18: 0xffff000102640670
 x19: 0xffffa00014bf0600
 x20: 0x0000000000000000
 x21: 0xffffa0000791a800
 x22: 0xffff000102640808
 x23: 0x000000000000000e
 x24: 0xffff0001026406e8
 x25: 0x000000000000000e
 x26: 0x0000000000000008
 x27: 0x0000000000000000
 x28: 0x000000003300a8c0
 x29: 0xffff000102640670
  sp: 0xffff000102640670
  lr: 0xffff000000929020 (dpaa2_ni_transmit + 0x38)
 elr: 0xffff00000092909c (dpaa2_ni_transmit + 0xb4)
spsr: 0x0000000040000045
 far: 0x00000000000002b8
 esr: 0x0000000096000004
panic: vm_fault failed: 0xffff00000092909c error 1
cpuid = 0
time = 21
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x308
handle_el1h_sync() at handle_el1h_sync+0x14
--- exception, esr 0x96000004
dpaa2_ni_transmit() at dpaa2_ni_transmit+0xb4
ether_output_frame() at ether_output_frame+0xd0
ether_output() at ether_output+0x664
ip_output_send() at ip_output_send+0xe8
ip_output() at ip_output+0x1394
pf_intr() at pf_intr+0x240
ithread_loop() at ithread_loop+0x3fc
fork_exit() at fork_exit+0x88
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100268 ]
Stopped at      kdb_enter+0x44: str     xzr, [x19, #1152]
snail59 commented 1 year ago

@dsalychev out of curiosity, does it make sense? Is there something obvious ? Do you need more information ?

dsalychev commented 1 year ago

It definitely does. Could you try a85d6c9ad5fe4de8cb3bc651253a1717fb28505c?

I've probably made a mistake with:

static int
dpaa2_ni_transmit(if_t ifp, struct mbuf *m)
{
...
    /* Transmit mbuf on the same interface it was received from */
    if (m->m_pkthdr.rcvif != NULL) {
        sc = if_getsoftc(m->m_pkthdr.rcvif);
    }
...
snail59 commented 1 year ago

Just tried it. So far so good, it did not panic on boot and has been running fine for 1 hour.

dsalychev commented 1 year ago

@snail59 sounds good :) Please, keep it loaded for some time and try your original scenario when the kernel panicked.

snail59 commented 1 year ago

So, it has been running for some time... And I had no problem at all. I could do way more than my original scenario. I did not run tests before/after so I can not compare performances.

Good work mate :-)

dsalychev commented 1 year ago

I could do way more than my original scenario

It'd be really good :) I hope I'll be able to commit those changes till 14.0. Thanks for testing!

snail59 commented 1 year ago

I hope too ! Otherwise (unless you revert your last commit), FreeBSD 14.0 won't install on ten64 :-/

snail59 commented 1 year ago

For your information (I know this is not your code's fault), you rebased your branch on main while there currently is a problem preventing the compilation. The message is ld: error: /usr/obj/traverse/sources/git/usr/src/arm64.aarch64/tmp/usr/lib/libcompiler_rt.a(absvdi2.o) is incompatible with /usr/obj/traverse/sources/git/usr/src/arm64.aarch64/tmp/usr/lib32/crti.o

I could not figure out yet which commit is faulty. Neither what is happening exactly as I am not a developer :-D

dsalychev commented 1 year ago

It looks like you've to re-compile the world as well. This worked for me:

$ nice make -s -j30 tinderbox TARGETS="arm64"
snail59 commented 1 year ago

It looks like you've to re-compile the world as well. This worked for me:

$ nice make -s -j30 tinderbox TARGETS="arm64"

For information (because this has nothing to do with the current issue) , I dug and it appears that it fails when using ccache and succeeds without. I suspect this is so since the LIB32 option was activated on aarch64. Plus, some binaries of the build machine are copied to a legacy/bin subdirectory of $MAKEOBJDIRPREFIX. Then they are used at installation time, so this breaks installation if the build was cross compiled !