oxidecomputer / maghemite

A routing stack written in Rust.
Mozilla Public License 2.0
35 stars 2 forks source link

ddm needs to be resilient to transient underlying platform errors #345

Open davepacheco opened 2 months ago

davepacheco commented 2 months ago

Using:

I launched an a4x2 setup with this environment:

set -o xtrace
export FALCON_DATASET=bigtank/falcon
export ROUTER_MODE=static
export THREE_NODE_RSS=true
export OMICRON=$HOME/omicron-a4x2
set +o xtrace

a4x2 launch succeeded but the system never made the handoff to Nexus. Sled Agent reports:

03:07:38.600Z INFO SledAgent (RSS): Failed to handoff to nexus: Communication Error: error sending request for url (http://[fd00:1122:3344:102::5]:12221/racks/ac3b34f6-d345-4b79-97b4-e7087d2e4fde/initialization-complete): error trying to connect: tcp connect error: Connection refused (os error 146)
    file = sled-agent/src/rack_setup/service.rs:894

It's failing to connect to that Nexus instance's internal API. Over on that Nexus, the log is reporting a bunch of:

18:51:45.162Z WARN 46bf733c-2afb-4fdb-a0c5-5daed739d85b (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    file = nexus/src/app/mod.rs:1072
    reason = Communication Error: error sending request for url (http://[fd00:1122:3344:103::2]:12225/local/switch-id): operation timed out

I also noticed a bunch of database errors from other Nexus zones:

16:37:41.759Z ERRO e5cfe630-66ed-4b6b-81ec-0ba69c4e301f (ServerContext): querying for requests to collect failed! Internal Error: unexpected database error: replica unavailable: (n5,s5):3 unable to serve request to r221:/Table/33{5-8} [(n4,s4):1, (n3,s3):2, (n5,s5):3, (n2,s2):4, next=6, gen=130]: lost quorum (down: (n4,s4):1,(n3,s3):2); closed timestamp: 1724266139.041620580,0 (2024-08-21 18:48:59); raft status: {"id":"3","term":6,"vote":"1","commit":20,"lead":"0","raftState":"StatePreCandidate","applied":20,"progress":{},"leadtransferee":"0"}: operation "probe" timed out after 59.997s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded
    error = Internal Error: unexpected database error: replica unavailable: (n5,s5):3 unable to serve request to r153:/Table/2{19-20} [(n4,s4):1, (n3,s3):2, (n5,s5):3, (n2,s2):4, next=6, gen=62]: lost quorum (down: (n4,s4):1,(n3,s3):2); closed timestamp: 1724266224.429079331,0 (2024-08-21 18:50:24); raft status: {"id":"3","term":6,"vote":"1","commit":23,"lead":"0","raftState":"StatePreCandidate","applied":23,"progress":{},"leadtransferee":"0"}: operation "probe" timed out after 59.997s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded

I dug into CockroachDB and found that 2 of the 5 nodes are reported "dead" and the reason is that their heartbeats are routinely taking more than 5s. Both of these nodes are on the same sled, g1. And those two nodes don't seem to have connectivity to the nodes on other sleds. After a bunch of digging I boiled it down to this observation: from g1's global zone, I cannot ping the underlay address of g3's global zone, but I can ping in the reverse direction. But even the reverse direction fails if I pick a different path.

So this works:

root@g3:~# ping -s -n -i vioif2 fd00:1122:3344:102::1
PING fd00:1122:3344:102::1 (fd00:1122:3344:102::1): 56 data bytes
64 bytes from fd00:1122:3344:102::1: icmp_seq=0. time=0.370 ms
64 bytes from fd00:1122:3344:102::1: icmp_seq=1. time=0.517 ms

This doesn't work:

root@g3:~# ping -s -n -i vioif1 fd00:1122:3344:102::1
PING fd00:1122:3344:102::1 (fd00:1122:3344:102::1): 56 data bytes
^C
----fd00:1122:3344:102::1 PING Statistics----
102 packets transmitted, 0 packets received, 100% packet loss

The question is: where is the packet being lost? @rmustacc helped me map out the various L2 devices that make up the Falcon topology. For my own future reference, you start with the config files in a4x2/.falcon/g{0,1,2,3}.toml, find the Viona NICs there, and look at the corresponding illumos devices to see which simnet the VNIC is over, what simnet that is connected to, and which VNIC is over that. For g1, that's:

vioif0 -> a4x2_g1_vn_vnic0 -> a4x2_g1_vn_sim0 -> a4x2_g0_sn_sim4 -> a4x2_g0_sn_vnic4
vioif1 -> a4x2_g1_vn_vnic1 -> a4x2_g1_vn_sim1 -> a4x2_g3_sn_sim1 -> a4x2_g3_sn_vnic1
vioif2 -> a4x2_g1_vn_vnic2 -> igb0

For g3, it's:

a4x2_g3_vn_vnic3 -> a4x2_g3_vn_sim3 -> a4x2_g0_sn_sim6 -> a4x2_g0_sn_vnic6
a4x2_g3_vn_vnic4 -> a4x2_g3_vn_sim4 -> a4x2_g3_sn_sim4 -> a4x2_g3_sn_vnic4
a4x2_g3_vn_vnic6 -> igb0

(I also learned through this that softnpu is running as an emulated device inside Propolis for the Scrimlet VMs.)

By snooping along these various points, we can figure out exactly where the packet is being dropped. We did that with pfexec snoop -d DEVICE -r icmp6 on ivanova, the machine hosting the a4x2.

From the above, there are two paths from g1's global zone to g3's global zone, and the one in use turns out to be:

It turns out the replies are going back over the other path, which goes through the switch attached to g0 rather than g3:

So the packet is being dropped in g0's softnpu. But why? Either softnpu is broken or the system has configured it wrong. Well, let's look at its routes:

root@oxz_switch:~# swadm route list
Subnet                   Port   Link  Gateway                 Vlan
0.0.0.0/0                qsfp0  0     198.51.101.1
fd00:1122:3344:1::/64    rear0  0     fe80::aa40:25ff:fe00:1
fd00:1122:3344:2::/64    rear1  0     fe80::aa40:25ff:fe00:3
fd00:1122:3344:3::/64    rear3  0     fe80::aa40:25ff:fe00:7
fd00:1122:3344:101::/64  rear0  0     fe80::aa40:25ff:fe00:1
fd00:1122:3344:103::/64  rear3  0     fe80::aa40:25ff:fe00:7
fdb0:a840:2500:1::/64    rear0  0     fe80::aa40:25ff:fe00:1
fdb0:a840:2500:3::/64    rear1  0     fe80::aa40:25ff:fe00:3
fdb0:a840:2500:7::/64    rear3  0     fe80::aa40:25ff:fe00:7

That's odd. We have no route for fd00:1122:3344:102::/64. We do on the other switch (g3):

root@oxz_switch:~# swadm route list
Subnet                   Port   Link  Gateway                 Vlan
0.0.0.0/0                qsfp0  0     198.51.101.5            
fd00:1122:3344:1::/64    rear0  0     fe80::aa40:25ff:fe00:2  
fd00:1122:3344:2::/64    rear1  0     fe80::aa40:25ff:fe00:4  
fd00:1122:3344:3::/64    rear3  0     fe80::aa40:25ff:fe00:8  
fd00:1122:3344:101::/64  rear0  0     fe80::aa40:25ff:fe00:2  
fd00:1122:3344:102::/64  rear1  0     fe80::aa40:25ff:fe00:4  
fd00:1122:3344:103::/64  rear3  0     fe80::aa40:25ff:fe00:8  
fdb0:a840:2500:1::/64    rear0  0     fe80::aa40:25ff:fe00:2  
fdb0:a840:2500:3::/64    rear1  0     fe80::aa40:25ff:fe00:4  
fdb0:a840:2500:5::/64    rear2  0     fe80::aa40:25ff:fe00:6  
fdb0:a840:2500:7::/64    rear3  0     fe80::aa40:25ff:fe00:8  

This seems likely to be the problem! But why do we have no routes there? g0's mgd does seem to know about the 102 prefix:

root@oxz_switch:~# /opt/oxide/mg-ddm/bin/ddmadm get-prefixes | sort
Destination              Next Hop                Path
fd00:1122:3344:101::/64  fe80::aa40:25ff:fe00:1  g0
fd00:1122:3344:102::/64  fe80::aa40:25ff:fe00:3  g1
fd00:1122:3344:103::/64  fe80::aa40:25ff:fe00:7  g3
fd00:1122:3344:1::/64    fe80::aa40:25ff:fe00:1  g0
fd00:1122:3344:2::/64    fe80::aa40:25ff:fe00:3  g1
fd00:1122:3344:3::/64    fe80::aa40:25ff:fe00:7  g3
fdb0:a840:2500:1::/64    fe80::aa40:25ff:fe00:1  g0
fdb0:a840:2500:3::/64    fe80::aa40:25ff:fe00:3  g1
fdb0:a840:2500:5::/64    fe80::aa40:25ff:fe00:5  g2
fdb0:a840:2500:7::/64    fe80::aa40:25ff:fe00:7  g3

@FelixMcFelix helped dig in and mentioned that it's mg-ddm that's responsible for keeping Dendrite (and thus the switch) up to date with the prefixes found. It's at this point that I start to lose the plot about what happened.

Searching through the mg-ddm log for that prefix, we find:

18:45:02.675Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
18:46:26.738Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
18:46:41.558Z WARN slog-rs: [tfportrear3_0] neighbor g3@fe80::aa40:25ff:fe00:7 missed solicit interval
18:46:51.998Z WARN slog-rs: [tfportrear0_0] neighbor g0@fe80::aa40:25ff:fe00:1 expire
18:46:53.541Z WARN slog-rs: [tfportrear3_0] neighbor g3@fe80::aa40:25ff:fe00:7 expire
18:46:53.541Z WARN slog-rs: [tfportrear2_0] neighbor g2@fe80::aa40:25ff:fe00:5 expire
18:46:53.541Z WARN slog-rs: [tfportrear0_0] expiring peer fe80::aa40:25ff:fe00:1 due to discovery even
t
18:46:53.541Z WARN slog-rs: [tfportrear2_0] expiring peer fe80::aa40:25ff:fe00:5 due to discovery event
18:46:53.542Z WARN slog-rs: [tfportrear3_0] expiring peer fe80::aa40:25ff:fe00:7 due to discovery event
18:46:53.542Z INFO slog-rs: [tfportrear0_0] nbr is fe80::aa40:25ff:fe00:1@g0 server
18:46:53.542Z INFO slog-rs: [tfportrear3_0] nbr is fe80::aa40:25ff:fe00:7@g3 server
18:46:53.542Z INFO slog-rs: [tfportrear2_0] nbr is fe80::aa40:25ff:fe00:5@g2 server
18:46:53.542Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 expire
18:46:53.542Z INFO slog-rs: [tfportrear0_0] removing routes 3 from dendrite
18:46:53.542Z INFO slog-rs: [tfportrear1_0] nbr is fe80::aa40:25ff:fe00:3@g1 server
18:46:53.542Z WARN slog-rs: [tfportrear1_0] expiring peer fe80::aa40:25ff:fe00:3 due to discovery event

and then:

18:47:40.161Z INFO slog-rs: [tfportrear1_0] adding route fd00:1122:3344:102:: -> fe80::aa40:25ff:fe00:3 on port Rear(RearPort(1))/0
18:47:40.162Z WARN slog-rs: [tfportrear0_0] exchange pull: timeout error: deadline has elapsed
18:47:40.719Z WARN slog-rs: [tfportrear0_0] exchange pull: timeout error: deadline has elapsed
18:47:40.772Z INFO slog-rs: [tfportrear0_0] sending 3 routes to dendrite
18:47:40.773Z DEBG slog-rs: [tfportrear0_0] sending to dpd host=localhost port=12224
18:47:41.344Z INFO slog-rs: [tfportrear0_0] adding route fdb0:a840:2500:1:: -> fe80::aa40:25ff:fe00:1 on port Rear(RearPort(0))/0
18:47:49.987Z ERRO slog-rs: [tfportrear3_0] dpd route delete: Communication Error: error sending request for url (http://localhost:12224/route/ipv6/fd00:1122:3344:3::%2F64): operation timed out
18:47:53.148Z ERRO slog-rs: [tfportrear1_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
18:47:54.015Z ERRO slog-rs: [tfportrear2_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
18:47:54.015Z INFO slog-rs: [tfportrear1_0] adding route fdb0:a840:2500:3:: -> fe80::aa40:25ff:fe00:3 on port Rear(RearPort(1))/0

It's a little clearer if we grep for tfportrear1_0:

18:45:02.675Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
18:46:26.738Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
18:46:53.542Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 expire
18:46:53.542Z INFO slog-rs: [tfportrear1_0] nbr is fe80::aa40:25ff:fe00:3@g1 server
18:46:53.542Z WARN slog-rs: [tfportrear1_0] expiring peer fe80::aa40:25ff:fe00:3 due to discovery even
t
18:46:54.093Z INFO slog-rs: [tfportrear1_0] removing routes 3 from dendrite
18:47:25.397Z ERRO slog-rs: [tfportrear1_0] failed to remove tunnel routes: {} OPTE driver is not atta
ched
18:47:25.397Z DEBG slog-rs: [tfportrear1_0] redistributing expire to 3 peers
18:47:25.397Z DEBG slog-rs: [tfportrear1_0] transition solicit -> exchange
18:47:25.398Z INFO slog-rs: [tfportrear1_0] exchange: listening on [fe80::aa40:25ff:fe0a:46ca]:56797
18:47:25.398Z INFO slog-rs: [tfportrear1_0] waiting for exchange server to start
18:47:25.398Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
...
18:47:27.004Z WARN slog-rs: [tfportrear1_0] exchange pull: timeout error: deadline has elapsed
18:47:27.919Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
...
18:47:29.339Z WARN slog-rs: [tfportrear1_0] exchange pull: timeout error: deadline has elapsed
18:47:29.844Z INFO slog-rs: [tfportrear1_0] sending 3 routes to dendrite
18:47:29.844Z DEBG slog-rs: [tfportrear1_0] sending to dpd host=localhost port=12224
18:47:34.924Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
        underlay: None,
        tunnel: None,
    }
18:47:40.161Z INFO slog-rs: [tfportrear1_0] adding route fd00:1122:3344:102:: -> fe80::aa40:25ff:fe00:3 on port Rear(RearPort(1))/0
18:47:53.148Z ERRO slog-rs: [tfportrear1_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
18:47:54.015Z INFO slog-rs: [tfportrear1_0] adding route fdb0:a840:2500:3:: -> fe80::aa40:25ff:fe00:3 on port Rear(RearPort(1))/0
18:48:00.662Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {

...
18:48:03.342Z INFO slog-rs: [tfportrear1_0] adding route fd00:1122:3344:2:: -> fe80::aa40:25ff:fe00:3 on port Rear(RearPort(1))/0
18:48:05.139Z INFO slog-rs: [tfportrear1_0] removing routes 0 from dendrite
18:48:05.918Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
...
18:48:05.918Z DEBG slog-rs: [tfportrear1_0] redistributing update to 3 peers
18:48:06.516Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
...
18:48:15.980Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
18:48:23.175Z INFO slog-rs: [tfportrear1_0] push from fe80::aa40:25ff:fe00:3: Update {
...
18:48:36.140Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
20:19:17.982Z WARN slog-rs: [tfportrear1_0] neighbor g1@fe80::aa40:25ff:fe00:3 missed solicit interval
...

At this point it seems like:

Over in Dendrite, we do have one instance of this message from Dropshot:

18:48:02.304Z WARN dpd: request handling cancelled (client disconnected)
    latency_us = 451330
    local_addr = [::1]:12224
    method = PUT
    remote_addr = [::1]:52819
    req_id = e6b0a18b-d80d-42ab-813d-e2fd0cc8b90c
    server_id = 1
    unit = api-server
    uri = /route/ipv6
...
18:48:03.386Z WARN dpd: request completed after handler was already cancelled
    local_addr = [::1]:12224
    method = PUT
    remote_addr = [::1]:52819
    req_id = e6b0a18b-d80d-42ab-813d-e2fd0cc8b90c
    response_code = 204
    server_id = 1
    unit = api-server
    uri = /route/ipv6

These are consistent with the client-side timeouts reported by mg-ddm. Note that if this happens, it shouldn't actually affect the request because Dropshot won't cancel the request handler. So even if this happened with all the "add route" requests, I think this wouldn't explain why the routes were not ultimately added.

But why do we only have one of these on the server, while we have a bunch of timeouts on the client:

root@oxz_switch:~# grep -h 'timed out' /var/svc/log/*mg-ddm* | grep -v '/metrics' | looker
18:47:34.924Z ERRO slog-rs: [tfportrear3_0] dpd route delete: Communication Error: error sending request for url (http://localhost:12224/route/ipv6/fd00:1122:3344:103::%2F64): operation timed out
18:47:49.987Z ERRO slog-rs: [tfportrear3_0] dpd route delete: Communication Error: error sending request for url (http://localhost:12224/route/ipv6/fd00:1122:3344:3::%2F64): operation timed out
18:47:53.148Z ERRO slog-rs: [tfportrear1_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
18:47:54.015Z ERRO slog-rs: [tfportrear2_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
18:48:01.916Z ERRO slog-rs: [tfportrear0_0] dpd route create: Communication Error: error sending request for url (http://localhost:12224/route/ipv6): operation timed out
Aug 21 18:40:38.676 WARN failed to lookup Nexus IP, will retry, error: "request timed out", delay: 137.299197ms, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391

The only explanations I can come up with are:


That's about as far as we got. I should mention that I saw this note in the docs:

Note that due to oxidecomputer/sidecar-lite#25 and #26, when ROUTER_MODE=static, after running pfexec ./a4x2 launch, you must ping the gateways from both switches (g0 and g3). You can see what the gateway switches are from the switch zone by running swadm route ls.

I believe we correctly applied that workaround and it made no difference here. And from what I can tell, the static routing config only affects what's upstream of the switches, not the rack-internal routing, so I think this is unrelated to my use of static routing here.

davepacheco commented 2 months ago

I've tar'd up the logs from both switch zones and put them on catacomb:/staff/core/maghemite-345:

dap@catacomb /staff/core/maghemite-345 $ ls -l
total 442434
-rw-rw-r--+  1 dap      staff    113941385 Aug 22 23:28 g0_oxz_switch_logs.tgz
-rw-rw-r--+  1 dap      staff    114086728 Aug 22 23:29 g3_oxz_switch_logs.tgz

I'll also leave the environment around in this state for a few hours in case anybody wants to look. I'll have to put your ssh key into root's authorized_keys but otherwise you should be able to read stuff from the lab network.

g0 172.20.2.183
g1 172.20.2.160
g3 172.20.2.168
rcgoodfellow commented 2 months ago

My intuition here is that the request from dendrite to softnpu hung. The communication between these two entities is over a virtual UART using TTY and all the loveliness that entails. While I've tried to make that channel reasonably robust, it still remains a bit hacky.

But more importantly, I think this has revealed a weakness in ddm that needs to be addressed. In handle_underlay_update we make a call to sys::add_underlay_routes

https://github.com/oxidecomputer/maghemite/blob/c92d6ff85db8992066f49da176cf686acfd8fe0f/ddm/src/exchange.rs#L955-L981

which eventually calls add_routes_dendrite

https://github.com/oxidecomputer/maghemite/blob/c92d6ff85db8992066f49da176cf686acfd8fe0f/ddm/src/sys.rs#L99-L120

in add_routes_dendrite we call the dendrite route_ivp6_set client endpoint, and if an error occurs we simply log it and move on.

https://github.com/oxidecomputer/maghemite/blob/c92d6ff85db8992066f49da176cf686acfd8fe0f/ddm/src/sys.rs#L230-L234

this means we cannot recover from transient errors.

I really need to go split ddm into a proper upper/lower half like mgd so we are iteratively driving the ASIC state to what has been determined by the upper-half protocol.

rcgoodfellow commented 2 months ago

To see if this is a persistent issue, I went to the owner of the missing prefix fd00:1122:3344:102::/64, sled g1 and re-announced.

root@g1:~# /opt/oxide/mg-ddm/ddmadm advertise-prefixes fd00:1122:3344:102::/64

Now back in the switch zone we see the prefix

root@oxz_switch:~# swadm route ls
Subnet                   Port   Link  Gateway                 Vlan
0.0.0.0/0                qsfp0  0     198.51.101.1
fd00:1122:3344:1::/64    rear0  0     fe80::aa40:25ff:fe00:1
fd00:1122:3344:2::/64    rear1  0     fe80::aa40:25ff:fe00:3
fd00:1122:3344:3::/64    rear3  0     fe80::aa40:25ff:fe00:7
fd00:1122:3344:101::/64  rear0  0     fe80::aa40:25ff:fe00:1
fd00:1122:3344:102::/64  rear1  0     fe80::aa40:25ff:fe00:3
fd00:1122:3344:103::/64  rear3  0     fe80::aa40:25ff:fe00:7
fd00:1701::d/64          rear2  0     fe80::99
fdb0:a840:2500:1::/64    rear0  0     fe80::aa40:25ff:fe00:1
fdb0:a840:2500:3::/64    rear1  0     fe80::aa40:25ff:fe00:3
fdb0:a840:2500:7::/64    rear3  0     fe80::aa40:25ff:fe00:7

and the ping from g3 now works

root@g3:~# ping -s -n -i vioif1 fd00:1122:3344:102::1
PING fd00:1122:3344:102::1 (fd00:1122:3344:102::1): 56 data bytes
64 bytes from fd00:1122:3344:102::1: icmp_seq=0. time=1.009 ms
64 bytes from fd00:1122:3344:102::1: icmp_seq=1. time=0.440 ms
^C
----fd00:1122:3344:102::1 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.440/0.725/1.009/0.402

So, in short, if we had an upper/lower architecture for mg-ddm with a state-driver/reconciler execution model. This would not have resulted in a permanent bad state.