oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

Nexus crashed upon hitting CRDB `connection reset by peer [exhausted]` error #5026

Open askfongjojo opened 8 months ago

askfongjojo commented 8 months ago
05:51:33.436Z ERRO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): delete_crucible_region: region_get saw Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2", "content-length": "84", "date": "Thu, 08 Feb 2024 05:51:33 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "7a3a9fb8-10c9-47d6-b878-edb41d8cf9d2" }
    file = nexus/src/app/sagas/common_storage.rs:162
    region_id = 1936ad0b-9d9a-4682-97c2-c64f1594b216
    saga_type = recovery
thread 'tokio-runtime-worker' panicked at nexus/db-queries/src/db/sec_store.rs:65:60:
called `Result::unwrap()` on an `Err` value: InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: read tcp [fd00:1122:3344:109::3]:49862->[fd00:1122:3344:105::3]:32221: read: connection reset by peer [exhausted]\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Feb  8 05:52:04 Stopping because all processes in service exited. ]
[ Feb  8 05:52:04 Executing stop method (:kill). ]

This started happening after rack2 was updated to omicron commit e88e06bf8e5eed16b42ac78ca536b06fdd0dc183. I was using terraform to create a bunch of disks and many started failing and attempted to unwind. I'll include the link to the complete nexus logs in a bit and also tracking down what the first disk_create failure was complaining about.

askfongjojo commented 8 months ago

Looks like a side effect of #5022. The complete nexus log can be found on catacomb - /staff/dock/rack2/omicron-5026/nexus.log.

askfongjojo commented 8 months ago

On the CRDB in sled 9 (which corresponds to the endpoint [fd00:1122:3344:105::3]:32221 in the nexus error message), I saw these log lines in cockroach.log about 10 seconds before nexus logged the database error:

root@oxz_cockroachdb_4c3ef132:/data/logs# less cockroach.oxzcockroachdb4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.root.2024-02-08T05_44_53Z.012360.log
...
W240208 05:51:23.117813 1624 server/node.go:894 <E2><8B><AE> [n2,summaries] 1067  health alerts detected: {Alerts:[{StoreID:0 Category:
<E2><80><B9>METRICS<E2><80><BA> Description:<E2><80><B9>round-trip-latency-p90<E2><80><BA> Value:1.342177279e+09}]}
W240208 05:51:23.365682 19747 kv/kvserver/raft_transport.go:674 <E2><8B><AE> [n2] 1068  while processing outgoing Raft queue to node 5: 
<E2><80><B9>EOF<E2><80><BA>:
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069  <E2><80><B9>[core]<E2><80><BA>
<E2><80><B9>[Channel #282 SubChannel #283] grpc: addrConn.createTransport failed to connect to {<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Addr": "[fd00:1
122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "ServerName": "[
fd00:1122:3344:108::3]:32221",<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Attributes": nu
ll,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "BalancerAttribu
tes": null,<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Type": 0,<E2>
<80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>  "Metadata": null
<E2><80><BA>
W240208 05:51:23.366456 23418 vendor/google.golang.org/grpc/grpclog/component.go:41 <E2><8B><AE> [-] 1069 +<E2><80><B9>}. Err: connection
 error: desc = "transport: Error while dialing cannot reuse client connection"<E2><80><BA>
I240208 05:51:24.105617 754 1@gossip/gossip.go:1547 <E2><8B><AE> [n2] 1070  node has connected to cluster via gossip
W240208 05:51:24.227311 22840 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1071  closed timestamps side-transport
 connection dropped from node: 3
W240208 05:51:24.341680 23133 kv/kvserver/closedts/sidetransport/receiver.go:139 <E2><8B><AE> [n2] 1072  closed timestamps side-transport
 connection dropped from node: 1
W240208 05:51:25.085137 19055 2@rpc/clock_offset.go:216 <E2><8B><AE> [n2,rnode=5,raddr=<E2><80><B9>[fd00:1122:3344:108::3]:32221<E2><80>
<BA>,class=system,heartbeat] 1073  latency jump (prev avg 669.83ms, current 5950.24ms)

The complete CRDB log can be found in catacomb, /staff/dock/rack2/omicron-5026/cockroachdb.log.

askfongjojo commented 8 months ago

I checked the 4 most recent nexus log files to see what precedes the database connection reset errors. They are not consistent - there are region delete, background nat rpw tasks, and blueprint planner tasks. At this point, it is unclear if disk create or #5022 are causing the issue. The failed disk create or unwind may be symptoms rather than causes. @mkeeter and @lefttwo saw multiple timeouts between the upstairs and downstairs so we may be dealing with a rack-level networking issue. More debugging is still needed.

davepacheco commented 8 months ago

Given that Nexus lost its connection to CockroachDB, the crash is #2416. The question is why did that connection become lost?

augustuswm commented 8 months ago

Checked the nexus core files that we collected from dogfood and confirmed that they each panicked at the above link point.

morlandi7 commented 7 months ago

Notes:

Nieuwejaar commented 7 months ago

The network failure is being worked as dendrite issue: https://github.com/oxidecomputer/dendrite/issues/846