Closed dch closed 1 year ago
@dch thanks for all of the details! Btw, was it GENERIC kernel you tested on?
On Thu, 23 Mar 2023, at 10:54, dsl wrote:
@dch https://github.com/dch thanks for all of the details! Btw, was it GENERIC kernel you tested on?
yes.
I've been seeing this a lot (like every 5-10 minutes) after 0d574d8ba8b244f40c1484123c5042f49ac642b8 with https://reviews.freebsd.org/D40094 sometimes so, early ten64 doesn't even complete switch to userland. May be a generic arm64 issue, but I'm not seeing this on other h/w pushing a lot more traffic.
@dch I'm not sure about https://github.com/mcusim/freebsd-src/commit/0d574d8ba8b244f40c1484123c5042f49ac642b8, but I've modified addresses translation recently. Could you try https://github.com/mcusim/freebsd-src/commit/718bdb6a71ba4ed1f557f89af1482a10f7b1cb74 and one before https://github.com/mcusim/freebsd-src/commit/74192f9b2d240edbd72215b8ee770485502ce8ee?
sorry it took a while but 718bdb6a71ba4ed1f557f89af1482a10f7b1cb74 is the culprit. Reverting this & we're all ok again.
sorry it took a while but 718bdb6 is the culprit. Reverting this & we're all ok again.
The original report here is from Mar 22 but that commit is from May 11. Time relationship seems wrong for 718bdb6 to be the only issue.
Correct, I thought that was clear from the original title & updated comment.
vm_fault failed
panic718bdb6
included, panics are frequent, every 5-10 minutes@dch Thanks for a summary, that's how I understood the issue. Its root cause is in the different channels accessing bus_dma resources concurrently, I assume. You won't see those panics with the only channel up and running. Just FYI, I'm trying to isolate channels within their own tasks and limit an access to shared resources as much as possible.
@dch I've prepared a lot of changes in the https://github.com/mcusim/freebsd-src/tree/dpaa2 branch. Could you try it? GENERIC kernel had worked for me under high network load for ~14 hours when I stopped the test myself. Btw, I've also discovered that the kernel panics with "undefined instruction" when the Ten64's SoC is heated up to 80-90C (sysctl hw.temperature
). Please, keep an eye on it.
It should be fixed on CURRENT with https://cgit.freebsd.org/src/commit/?id=58983e4b0253ad38a3e1ef2166fedd3133fdb552 merged in.
so far LGTM on 15.0-CURRENT - a 3h test (albeit on 1G ifaces only) is stable. awesome! I need to move some cabling around for 10G but this is great progress!
thanks @dsalychev
I'm on stable/14
and am planning to switch to releng/14.0
when it's branched off, but it also seems stable.
But regarding SFP+ ports, I'm not able to connect to them. I have Intel X520-DA2 card:
ix0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10fb subvendor=0x8086 subdevice=0x7a11
vendor = 'Intel Corporation'
device = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
class = network
subclass = ethernet
It's able to link up when plugged in via loopback, but not when I plug in to Ten64. I haven't reported it yet, because I still haven't tested it working under Linux.
@dch, @pkubaj thanks for all of the tests. Please, don't expect SFP+ to be operational anyhow at the moment. I've just started working on a design of something I call "sffbus" (similar to miibus(4)).
using e04c4b4a369df3f1dcbebbdf726193f02af60801 this still stable. thanks!
Good to know :) Thanks for testing!
this only reproduces when more than usual cross-dpaa interface traffic is present. I can trigger it using iperf3 reliably. This is using normal CURRENT, not fork.
while true; vmstat -i | grep dpaa2_io; sleep 1; end
top -SjwHPz -mcpu
at moment of crash (tmux over mosh)