Open brian90013 opened 4 years ago
Last week I upgraded one of my machines to follow FreeBSD-12.2-BETA1. This morning I observed the same reboot behavior with the cxgbe driver in emulated mode on 12.2.
uname
FreeBSD tr 12.2-BETA1 FreeBSD 12.2-BETA1 r365761 GENERIC amd64
I have updated the test machine to run FreeBSD tr 12.2-STABLE FreeBSD 12.2-STABLE #2 r366977M
. I am still seeing this issue with multiple drivers, but with INVARIANTS
enabled I have a bit more output to report.
Output is above, fatal trap 12
with instruction pointer set to 0. It appears the process was a worker thread for the network driver.
Same fatal trap, same fault code, same instruction pointer. This time the debugger reports thread irq308: t6nex0: 1a4
as the culprit. The first cxgbe function in the stack trace is t4_vi_intr()
which is an interrupt handler "for vectors shared between NIC and netmap rx queues". Here we are in emulated mode so the netmap-specific code in the driver should not be running.
Same trap, fault, instruction pointer. Thread irq353: mlx5_core1
is running and the stack trace ends with mlx5e_rx_cq_comp()
.
So I see the same behavior with all three NICs - the common code is the emulated netmap adapter. The problem is seen only when the receiver has multiple threads running. My guess is as pkt-gen is shutting down, the netmap driver has removed itself from some of the queues when data comes in other queues and the NIC attempts to call a netmap handler that has been removed, causing the fault? That would explain why it only occurs when receiving with multiple threads and why the stack traces all point to NIC IRQ code, not anything in the netmap driver.
Problem
I have seen the following issue on two different FreeBSD servers using three different NICs. I run two copies of pkt-gen on the same server; one as the transmitter and one as the receiver with the two ports connected by a loopback cable. Netmap is used in generic adapter mode (the only option for mlx5en, used for comparison testing with cxgbe and igb) and the receiver is configured for more than one queue.
When I stop the receiving pkt-gen using CTRL-C, I will frequently observe the machine immediately reboot. I have used
sysctl debug.kdb.panic=1
(BEWARE, this will panic your system immediately) to verify my machines handle panics and write out a core image. I do not see the normal panic behavior in this case. Instead the machine immediately reboots. My one machine does print to/var/log/messages
and the messages about the fault are below. I see the same problem using each of the three NICs listed.I see the same problem using my own netmap application, but have used pkt-gen here for ease of reproducing the problem. Therefore I don't think it is a problem with pkt-gen but perhaps the underlying generic adapter code? I am hoping to use multiple queues in generic adapter mode to benefit from NICs with hardware queues that don't have native netmap support. Thank you for your help!
Procedure
Process output
Transmitter output
Receiver
/var/log/messages