Open jgallagher opened 11 months ago
I wanted to look at the CockroachDB SMF log files, but they're all empty. I compared with dogfood and they're empty there, too. There are no rotated files in /var/svc/log. I found #3713 and found some useful log files on dogfood (sled 9):
# find /pool/ext/*/crypt/debug/oxz_cockroachdb* -type f -name '*oxide-cockroach*' -not -size 0
/pool/ext/53460a32-da24-4921-9f57-1d35203741ba/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1690918415
/pool/ext/790dc7c3-7fcf-41ba-b5ac-83adead857cd/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691451373
/pool/ext/f040f5e2-146a-48ef-8abe-610af6dba74f/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691799001
(is there a better way to find archived log files?)
These do have the stuff I expected to be there (svc.startd messages, xtrace output from the start method, etc.). We don't have these files from the failed job, unfortunately. It seems like we should add those to the pattern of logs that we have buildomat save on failure.
All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?
WARN nexus: Cannot look up rack: Object (of type ById(24ed7902-3649-4e0e-8635-cd082ae9b0c0)) not found: rack file = nexus/src/app/rack.rs:555
I would expect that the "rack" objects are inserted during rack initialization at Nexus (i.e., handoff to Nexus), not when the database is initialized. It's possible that the database was initialized but handoff never happened for some reason. n the Sled Agent log, I do see a lot of failures to handoff to Nexus, including this last one:
15:54:16.561Z INFO SledAgent (RSS): Failed to handoff to nexus: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0", "content-length": "124", "date": "Wed, 16 Aug 2023 15:54:16 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0" }
file = sled-agent/src/rack_setup/service.rs:625
and I see no indication after that of a successful handoff. So I think that explains the warnings and the ultimate failure after the 10-minute timeout. Next will be to look at the Nexus logs to see why these requests are failing.
It looks like of the initialization-complete requests went to the Nexus oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91
. Here's a summary of all of that Nexus's requests:
4286e358-034e-46fa-8cba-b5f3595dbaa2 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-f4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
65196cac-d1de-4e42-9e3b-ca60fc699abe PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/24b4dc87-ab46-49fb-a4b4-d361ae214c03 200
f8d78c80-0b74-4886-a9cd-a900ec0abb90 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-e4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
17ffc61a-dbc6-4542-879f-1fe94c0b02bd PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-d462a7f7-b628-40fe-80ff-4e4189e2d62b", model = "synthetic-model""
cb1121fa-d100-4cfd-91ae-941a87e4064d PUT /physical-disk 200
e8360586-d6dc-479e-aee3-c7658489af10 PUT /physical-disk 200
133bf25a-5946-4042-b46d-d8cbd2362e8b PUT /physical-disk 200
ede72848-d90e-4ea3-b45e-5bd642164579 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
f1e6b7cb-8637-41b0-9e34-40d34465781a PUT /physical-disk 200
78d16e4b-9ec2-442b-9be5-e2c8b2c12d7f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/a462a7f7-b628-40fe-80ff-4e4189e2d62b 200
d1e3f827-eb08-449e-957f-ee6d903c00a1 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/b462a7f7-b628-40fe-80ff-4e4189e2d62b 200
93f31b07-9730-4ef2-a72b-b3c7c923901e PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/14b4dc87-ab46-49fb-a4b4-d361ae214c03 200
90a5e166-21dc-4c90-a4b1-9b3391bfba0f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 200
3abb72f6-5538-42da-ab4f-35cfc47cfb85 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 200
2c2481cf-670c-4dcb-9872-36736483502f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 200
c5d66dda-ce35-44bb-b952-58d99f9b3a2d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
3b671939-d2c9-4c52-84d3-7b1e2d3c2edb PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
b70ed9ef-6644-4a1b-813a-f93e0a8c6b0b PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
68eff423-65e6-4338-adaa-b474553377c9 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
258e5501-b9a1-4489-8910-91f2a9dd4bba PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
d7794f35-efa9-4678-ab90-d44558e8ba77 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
ebb29939-3499-4f31-8f07-4ee7fba58a49 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
7e8d8543-fde6-46e1-8b4f-c566310742d3 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
81f11842-46d3-465f-aef7-ac50cde78cfc PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
9105ccab-cfb5-4e81-b5fb-9ccc500d8038 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
85cd68b3-d8bb-43c0-9452-b760d088a4ae PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
a42ac22f-5e98-4696-9d9e-1696ee8a4a7d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
fbb18155-5ed3-436f-bd32-21380e404198 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
33d16fbc-206a-456f-8453-875400285df5 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
and here's the details of one of those 500s on initialization-complete
:
15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
error_message_external = Internal Server Error
error_message_internal = unexpected database error: Record not found
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
latency_us = 26103
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
response_code = 500
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
There's not much to go on here. This looks like an error path during rack initialization that's not annotating the error object it got back from the database with anything useful about what it was doing. The string "Record not found" does not appear in Omicron. I guessed this was coming from use of Diesel's first()
, which does appear to be the case:
https://github.com/diesel-rs/diesel/blob/v2.1.0/diesel/src/result.rs#L310
I started looking around for calls to first()
to inspect those, but that wasn't very fruitful. There were a bunch and it was hard to tell if they were called from the rack initialization code path.
So I started with the rack initialization code path, which opens here: https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/rack.rs#L398-L416
We can infer something about where the error was generated based on the log entries that we do have for one of the failed requests. I grabbed the request id from the last 500 error and filtered all the messages in the Nexus log for that req_id:
$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | looker
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.537Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Modify
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Fleet
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.540Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = CreateChild
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Fleet
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.548Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = CreateChild
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Fleet
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.553Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = CreateChild
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Fleet
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.557Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = ListChildren
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = IpPoolList
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
roles = RoleSet { roles: {} }
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
action = Query
actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
resource = Database
result = Ok(())
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
error_message_external = Internal Server Error
error_message_internal = unexpected database error: Record not found
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
latency_us = 26103
local_addr = [fd00:1122:3344:101::b]:12221
method = PUT
remote_addr = [fd00:1122:3344:101::1]:44358
req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
response_code = 500
uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
We can summarize the actions and results:
$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | grep '"authorize result"' | json -ga resource action result
Database Query Ok(())
Fleet Modify Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
IpPoolList ListChildren Ok(())
Database Query Ok(())
That ListChildren
on IpPoolList
is interesting. That is checked here (possibly other places too, I didn't look):
https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/ip_pool.rs#L108-L151
which is called early in rack_set_initialized
. And right after the authz check, we grab a database connection (explaining the last Database Query
entry above) and then issue a query that uses get_result()
, which, like first()
, produces this "Record not found" error if there were no rows returned by the query. I think it's a good bet this is where we failed. This code is assuming that an internal services IP pool will be present, but maybe it's not?
What's responsible for setting that up? I didn't find anything in nexus/db-queries/src/db/fixed_data
, but I did find:
https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/rack.rs#L550-L588. This is being invoked by the "populate" mini-framework in nexus/src/populate.rs. Did that run successfully? I'm not clear. There are no log messages in this function, unfortunately. The populate framework does not log each populator, either, or when "populate" finishes. However, when creating Nexus
we do spawn a background task to wait for this and that does log a message when it finishes or fails:
https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/src/app/mod.rs#L360-L375
I don't see that log message in this log. It seems like the "populate" step maybe got stuck? More investigation is needed here.
I just noticed that the Nexus log that I've been looking at from buildomat doesn't start at the beginning of Nexus's lifetime. Like the CockroachDB log, it must have been rotated and archived already and we don't have that earlier file. That makes it hard to say anything about which populate steps may have run, failed, finished, etc.
There are several bugs/improvements that would help us debug this kind of problem the next time it happens:
Error::internal_context()
in more places, especially uses of first()
, first_async()
, get_result()
and get_result_async()
to add context about why it expected something to be thereWe should definitely fix the buildomat job. I'm not sure if it's worth debugging this further until we do that. We might be able to find more information but it'll probably be a lot clearer with the logs in hand.
(is there a better way to find archived log files?)
These were archived out to the U.2s as part of #3713, I believe. They're still named by zone, so filtering by that would help. I think the dataset they end up in is unpredictable, though the archiver continues to use the same location until it runs out of space.
It unfortunately does not help post-mortem, but if you're debugging a live system, the zone-bundle
tool will fish those log files out for you as well.
$ zone-bundle --host <SLED_AGENT_IP> get --create oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91
That command will create a tarball with all the log files, either still in /var/svc/log
or out on those U.2s.
https://github.com/oxidecomputer/omicron/pull/3887/checks?check_run_id=15946171931 failed; the immediate symptom in the deploy job a timeout after trying to log in for 10 minutes:
All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?
but the sled-agent logs do not report any issues with initial CRDB setup: