Open askfongjojo opened 8 months ago
Looks like a side effect of #5022. The complete nexus log can be found on catacomb - /staff/dock/rack2/omicron-5026/nexus.log.
On the CRDB in sled 9 (which corresponds to the endpoint [fd00:1122:3344:105::3]:32221 in the nexus error message), I saw these log lines in cockroach.log about 10 seconds before nexus logged the database error:
root@oxz_cockroachdb_4c3ef132:/data/logs# less cockroach.oxzcockroachdb4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.root.2024-02-08T05_44_53Z.012360.log
...
W240208 05:51:23.117813 1624 server/node.go:894 <E2><8B><AE> [n2,summaries] 1067 health alerts detected: {Alerts:[{StoreID:0 Category:
<E2><80><B9>METRICS<E2><80><BA> Description:<E2><80><B9>round-trip-latency-p90<E2><80><BA> Value:1.342177279e+09}]}
W240208 05:51:23.365682 19747 kv/kvserver/raft_transport.go:674 <E2><8B><AE> [n2] 1068 while processing outgoing Raft queue to node 5:
<E2><80><B9>EOF<E2><80><BA>:
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 <E2><80><B9>[core]<E2><80><BA>
<E2><80><B9>[Channel #282 SubChannel #283] grpc: addrConn.createTransport failed to connect to {<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "Addr": "[fd00:1
122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "ServerName": "[
fd00:1122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "Attributes": nu
ll,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "BalancerAttribu
tes": null,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "Type": 0,<E2>
<80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9> "Metadata": null
<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>}. Err: connection
error: desc = "transport: Error while dialing cannot reuse client connection"<E2><80><BA>
I240208 05:51:24.105617 754 1@gossip/gossip.go:1547 <E2><8B><AE> [n2] 1070 node has connected to cluster via gossip
W240208 05:51:24.227311 22840 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1071 closed timestamps side-transport
connection dropped from node: 3
W240208 05:51:24.341680 23133 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1072 closed timestamps side-transport
connection dropped from node: 1
W240208 05:51:25.085137 19055 2@rpc/clock_offset.go:216 <E2><8B><AE> [n2,rnode=5,raddr=<E2><80><B9>[fd00:1122:3344:108::3]:32221<E2><80>
<BA>,class=system,heartbeat] 1073 latency jump (prev avg 669.83ms, current 5950.24ms)
The complete CRDB log can be found in catacomb, /staff/dock/rack2/omicron-5026/cockroachdb.log.
I checked the 4 most recent nexus log files to see what precedes the database connection reset errors. They are not consistent - there are region delete, background nat rpw tasks, and blueprint planner tasks. At this point, it is unclear if disk create or #5022 are causing the issue. The failed disk create or unwind may be symptoms rather than causes. @mkeeter and @lefttwo saw multiple timeouts between the upstairs and downstairs so we may be dealing with a rack-level networking issue. More debugging is still needed.
Given that Nexus lost its connection to CockroachDB, the crash is #2416. The question is why did that connection become lost?
Checked the nexus core files that we collected from dogfood and confirmed that they each panicked at the above link point.
Notes:
The network failure is being worked as dendrite issue: https://github.com/oxidecomputer/dendrite/issues/846
This started happening after rack2 was updated to omicron commit
e88e06bf8e5eed16b42ac78ca536b06fdd0dc183
. I was using terraform to create a bunch of disks and many started failing and attempted to unwind. I'll include the link to the complete nexus logs in a bit and also tracking down what the first disk_create failure was complaining about.