Open davepacheco opened 2 months ago
I've tar'd up the logs from both switch zones and put them on catacomb:/staff/core/maghemite-345:
dap@catacomb /staff/core/maghemite-345 $ ls -l
total 442434
-rw-rw-r--+ 1 dap staff 113941385 Aug 22 23:28 g0_oxz_switch_logs.tgz
-rw-rw-r--+ 1 dap staff 114086728 Aug 22 23:29 g3_oxz_switch_logs.tgz
I'll also leave the environment around in this state for a few hours in case anybody wants to look. I'll have to put your ssh key into root's authorized_keys but otherwise you should be able to read stuff from the lab network.
g0 172.20.2.183
g1 172.20.2.160
g3 172.20.2.168
My intuition here is that the request from dendrite to softnpu hung. The communication between these two entities is over a virtual UART using TTY and all the loveliness that entails. While I've tried to make that channel reasonably robust, it still remains a bit hacky.
But more importantly, I think this has revealed a weakness in ddm
that needs to be addressed. In handle_underlay_update
we make a call to sys::add_underlay_routes
which eventually calls add_routes_dendrite
in add_routes_dendrite
we call the dendrite route_ivp6_set
client endpoint, and if an error occurs we simply log it and move on.
this means we cannot recover from transient errors.
I really need to go split ddm
into a proper upper/lower half like mgd
so we are iteratively driving the ASIC state to what has been determined by the upper-half protocol.
To see if this is a persistent issue, I went to the owner of the missing prefix fd00:1122:3344:102::/64
, sled g1
and re-announced.
root@g1:~# /opt/oxide/mg-ddm/ddmadm advertise-prefixes fd00:1122:3344:102::/64
Now back in the switch zone we see the prefix
root@oxz_switch:~# swadm route ls
Subnet Port Link Gateway Vlan
0.0.0.0/0 qsfp0 0 198.51.101.1
fd00:1122:3344:1::/64 rear0 0 fe80::aa40:25ff:fe00:1
fd00:1122:3344:2::/64 rear1 0 fe80::aa40:25ff:fe00:3
fd00:1122:3344:3::/64 rear3 0 fe80::aa40:25ff:fe00:7
fd00:1122:3344:101::/64 rear0 0 fe80::aa40:25ff:fe00:1
fd00:1122:3344:102::/64 rear1 0 fe80::aa40:25ff:fe00:3
fd00:1122:3344:103::/64 rear3 0 fe80::aa40:25ff:fe00:7
fd00:1701::d/64 rear2 0 fe80::99
fdb0:a840:2500:1::/64 rear0 0 fe80::aa40:25ff:fe00:1
fdb0:a840:2500:3::/64 rear1 0 fe80::aa40:25ff:fe00:3
fdb0:a840:2500:7::/64 rear3 0 fe80::aa40:25ff:fe00:7
and the ping from g3
now works
root@g3:~# ping -s -n -i vioif1 fd00:1122:3344:102::1
PING fd00:1122:3344:102::1 (fd00:1122:3344:102::1): 56 data bytes
64 bytes from fd00:1122:3344:102::1: icmp_seq=0. time=1.009 ms
64 bytes from fd00:1122:3344:102::1: icmp_seq=1. time=0.440 ms
^C
----fd00:1122:3344:102::1 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms) min/avg/max/stddev = 0.440/0.725/1.009/0.402
So, in short, if we had an upper/lower architecture for mg-ddm with a state-driver/reconciler execution model. This would not have resulted in a permanent bad state.
Using:
5ba7808685dcbfa5c4ef0bc251d27d16d1671304
I launched an a4x2 setup with this environment:
a4x2 launch
succeeded but the system never made the handoff to Nexus. Sled Agent reports:It's failing to connect to that Nexus instance's internal API. Over on that Nexus, the log is reporting a bunch of:
I also noticed a bunch of database errors from other Nexus zones:
I dug into CockroachDB and found that 2 of the 5 nodes are reported "dead" and the reason is that their heartbeats are routinely taking more than 5s. Both of these nodes are on the same sled, g1. And those two nodes don't seem to have connectivity to the nodes on other sleds. After a bunch of digging I boiled it down to this observation: from g1's global zone, I cannot ping the underlay address of g3's global zone, but I can ping in the reverse direction. But even the reverse direction fails if I pick a different path.
So this works:
This doesn't work:
The question is: where is the packet being lost? @rmustacc helped me map out the various L2 devices that make up the Falcon topology. For my own future reference, you start with the config files in
a4x2/.falcon/g{0,1,2,3}.toml
, find the Viona NICs there, and look at the corresponding illumos devices to see which simnet the VNIC is over, what simnet that is connected to, and which VNIC is over that. For g1, that's:For g3, it's:
(I also learned through this that softnpu is running as an emulated device inside Propolis for the Scrimlet VMs.)
By snooping along these various points, we can figure out exactly where the packet is being dropped. We did that with
pfexec snoop -d DEVICE -r icmp6
on ivanova, the machine hosting the a4x2.From the above, there are two paths from g1's global zone to g3's global zone, and the one in use turns out to be:
It turns out the replies are going back over the other path, which goes through the switch attached to g0 rather than g3:
So the packet is being dropped in g0's softnpu. But why? Either softnpu is broken or the system has configured it wrong. Well, let's look at its routes:
That's odd. We have no route for
fd00:1122:3344:102::/64
. We do on the other switch (g3):This seems likely to be the problem! But why do we have no routes there? g0's mgd does seem to know about the 102 prefix:
@FelixMcFelix helped dig in and mentioned that it's mg-ddm that's responsible for keeping Dendrite (and thus the switch) up to date with the prefixes found. It's at this point that I start to lose the plot about what happened.
Searching through the mg-ddm log for that prefix, we find:
and then:
It's a little clearer if we grep for tfportrear1_0:
At this point it seems like:
Over in Dendrite, we do have one instance of this message from Dropshot:
These are consistent with the client-side timeouts reported by mg-ddm. Note that if this happens, it shouldn't actually affect the request because Dropshot won't cancel the request handler. So even if this happened with all the "add route" requests, I think this wouldn't explain why the routes were not ultimately added.
But why do we only have one of these on the server, while we have a bunch of timeouts on the client:
The only explanations I can come up with are:
That's about as far as we got. I should mention that I saw this note in the docs:
I believe we correctly applied that workaround and it made no difference here. And from what I can tell, the static routing config only affects what's upstream of the switches, not the rack-internal routing, so I think this is unrelated to my use of static routing here.