oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
227 stars 35 forks source link

Test flake: silent failure to initialize cockroachdb? #3889

Open jgallagher opened 11 months ago

jgallagher commented 11 months ago

https://github.com/oxidecomputer/omicron/pull/3887/checks?check_run_id=15946171931 failed; the immediate symptom in the deploy job a timeout after trying to log in for 10 minutes:

673 2023-08-16T15:51:53.331Z    2023-08-16 15:51:52.586682700 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
674 2023-08-16T15:51:54.332Z    2023-08-16 15:51:53.587485600 UTC: attempting to log into API
675 2023-08-16T15:52:09.366Z    2023-08-16 15:52:08.621092329 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
676 2023-08-16T15:52:10.368Z    2023-08-16 15:52:09.622786274 UTC: attempting to log into API
677 2023-08-16T15:52:25.402Z    2023-08-16 15:52:24.656701751 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
678 2023-08-16T15:52:26.403Z    2023-08-16 15:52:25.657578343 UTC: attempting to log into API
679 2023-08-16T15:52:41.436Z    2023-08-16 15:52:40.691037367 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
680 2023-08-16T15:52:42.439Z    Error: logging in
681 2023-08-16T15:52:42.464Z    
682 2023-08-16T15:52:42.489Z    Caused by:
683 2023-08-16T15:52:43.040Z        timed out after 609.325828065s

All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?

WARN    nexus: Cannot look up rack: Object (of type ById(24ed7902-3649-4e0e-8635-cd082ae9b0c0)) not found: rack
    file = nexus/src/app/rack.rs:555

but the sled-agent logs do not report any issues with initial CRDB setup:

328 2023-08-16T15:43:19.640Z    INFO    SledAgent (ServiceManager): Formatting CRDB
    file = sled-agent/src/services.rs:2248
...
330 2023-08-16T15:43:22.432Z    INFO    SledAgent (ServiceManager): Formatting CRDB - Completed
    file = sled-agent/src/services.rs:2269
davepacheco commented 11 months ago

I wanted to look at the CockroachDB SMF log files, but they're all empty. I compared with dogfood and they're empty there, too. There are no rotated files in /var/svc/log. I found #3713 and found some useful log files on dogfood (sled 9):

# find /pool/ext/*/crypt/debug/oxz_cockroachdb* -type f -name '*oxide-cockroach*' -not -size 0
/pool/ext/53460a32-da24-4921-9f57-1d35203741ba/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1690918415
/pool/ext/790dc7c3-7fcf-41ba-b5ac-83adead857cd/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691451373
/pool/ext/f040f5e2-146a-48ef-8abe-610af6dba74f/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691799001

(is there a better way to find archived log files?)

These do have the stuff I expected to be there (svc.startd messages, xtrace output from the start method, etc.). We don't have these files from the failed job, unfortunately. It seems like we should add those to the pattern of logs that we have buildomat save on failure.

davepacheco commented 11 months ago

All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?

WARN  nexus: Cannot look up rack: Object (of type ById(24ed7902-3649-4e0e-8635-cd082ae9b0c0)) not found: rack
    file = nexus/src/app/rack.rs:555

I would expect that the "rack" objects are inserted during rack initialization at Nexus (i.e., handoff to Nexus), not when the database is initialized. It's possible that the database was initialized but handoff never happened for some reason. n the Sled Agent log, I do see a lot of failures to handoff to Nexus, including this last one:

15:54:16.561Z INFO SledAgent (RSS): Failed to handoff to nexus: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0", "content-length": "124", "date": "Wed, 16 Aug 2023 15:54:16 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0" }
    file = sled-agent/src/rack_setup/service.rs:625

and I see no indication after that of a successful handoff. So I think that explains the warnings and the ultimate failure after the 10-minute timeout. Next will be to look at the Nexus logs to see why these requests are failing.

davepacheco commented 11 months ago

It looks like of the initialization-complete requests went to the Nexus oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91. Here's a summary of all of that Nexus's requests:

4286e358-034e-46fa-8cba-b5f3595dbaa2 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-f4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
65196cac-d1de-4e42-9e3b-ca60fc699abe PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/24b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
f8d78c80-0b74-4886-a9cd-a900ec0abb90 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-e4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
17ffc61a-dbc6-4542-879f-1fe94c0b02bd PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-d462a7f7-b628-40fe-80ff-4e4189e2d62b", model = "synthetic-model""
cb1121fa-d100-4cfd-91ae-941a87e4064d PUT /physical-disk 200 
e8360586-d6dc-479e-aee3-c7658489af10 PUT /physical-disk 200 
133bf25a-5946-4042-b46d-d8cbd2362e8b PUT /physical-disk 200 
ede72848-d90e-4ea3-b45e-5bd642164579 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
f1e6b7cb-8637-41b0-9e34-40d34465781a PUT /physical-disk 200 
78d16e4b-9ec2-442b-9be5-e2c8b2c12d7f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/a462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
d1e3f827-eb08-449e-957f-ee6d903c00a1 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/b462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
93f31b07-9730-4ef2-a72b-b3c7c923901e PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/14b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
90a5e166-21dc-4c90-a4b1-9b3391bfba0f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
3abb72f6-5538-42da-ab4f-35cfc47cfb85 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
2c2481cf-670c-4dcb-9872-36736483502f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
c5d66dda-ce35-44bb-b952-58d99f9b3a2d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
3b671939-d2c9-4c52-84d3-7b1e2d3c2edb PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
b70ed9ef-6644-4a1b-813a-f93e0a8c6b0b PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
68eff423-65e6-4338-adaa-b474553377c9 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
258e5501-b9a1-4489-8910-91f2a9dd4bba PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
d7794f35-efa9-4678-ab90-d44558e8ba77 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
ebb29939-3499-4f31-8f07-4ee7fba58a49 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
7e8d8543-fde6-46e1-8b4f-c566310742d3 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
81f11842-46d3-465f-aef7-ac50cde78cfc PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
9105ccab-cfb5-4e81-b5fb-9ccc500d8038 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
85cd68b3-d8bb-43c0-9452-b760d088a4ae PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
a42ac22f-5e98-4696-9d9e-1696ee8a4a7d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
fbb18155-5ed3-436f-bd32-21380e404198 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
33d16fbc-206a-456f-8453-875400285df5 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found

and here's the details of one of those 500s on initialization-complete:

15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = unexpected database error: Record not found
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
    latency_us = 26103
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    response_code = 500
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete

There's not much to go on here. This looks like an error path during rack initialization that's not annotating the error object it got back from the database with anything useful about what it was doing. The string "Record not found" does not appear in Omicron. I guessed this was coming from use of Diesel's first(), which does appear to be the case: https://github.com/diesel-rs/diesel/blob/v2.1.0/diesel/src/result.rs#L310

davepacheco commented 11 months ago

I started looking around for calls to first() to inspect those, but that wasn't very fruitful. There were a bunch and it was hard to tell if they were called from the rack initialization code path.

So I started with the rack initialization code path, which opens here: https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/rack.rs#L398-L416

We can infer something about where the error was generated based on the log entries that we do have for one of the failed requests. I grabbed the request id from the last 500 error and filtered all the messages in the Nexus log for that req_id:

$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | looker
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.537Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Modify
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.540Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.548Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.553Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.557Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = ListChildren
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = IpPoolList
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = unexpected database error: Record not found
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
    latency_us = 26103
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    response_code = 500
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete

We can summarize the actions and results:

$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | grep '"authorize result"' | json -ga resource action result
Database Query Ok(())
Fleet Modify Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
IpPoolList ListChildren Ok(())
Database Query Ok(())

That ListChildren on IpPoolList is interesting. That is checked here (possibly other places too, I didn't look): https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/ip_pool.rs#L108-L151

which is called early in rack_set_initialized. And right after the authz check, we grab a database connection (explaining the last Database Query entry above) and then issue a query that uses get_result(), which, like first(), produces this "Record not found" error if there were no rows returned by the query. I think it's a good bet this is where we failed. This code is assuming that an internal services IP pool will be present, but maybe it's not?

What's responsible for setting that up? I didn't find anything in nexus/db-queries/src/db/fixed_data, but I did find: https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/db-queries/src/db/datastore/rack.rs#L550-L588. This is being invoked by the "populate" mini-framework in nexus/src/populate.rs. Did that run successfully? I'm not clear. There are no log messages in this function, unfortunately. The populate framework does not log each populator, either, or when "populate" finishes. However, when creating Nexus we do spawn a background task to wait for this and that does log a message when it finishes or fails: https://github.com/oxidecomputer/omicron/blob/496b854c059b85d51931ea8c3f209a768708dcc4/nexus/src/app/mod.rs#L360-L375

I don't see that log message in this log. It seems like the "populate" step maybe got stuck? More investigation is needed here.

davepacheco commented 11 months ago

I just noticed that the Nexus log that I've been looking at from buildomat doesn't start at the beginning of Nexus's lifetime. Like the CockroachDB log, it must have been rotated and archived already and we don't have that earlier file. That makes it hard to say anything about which populate steps may have run, failed, finished, etc.

There are several bugs/improvements that would help us debug this kind of problem the next time it happens:

We should definitely fix the buildomat job. I'm not sure if it's worth debugging this further until we do that. We might be able to find more information but it'll probably be a lot clearer with the logs in hand.

bnaecker commented 11 months ago

(is there a better way to find archived log files?)

These were archived out to the U.2s as part of #3713, I believe. They're still named by zone, so filtering by that would help. I think the dataset they end up in is unpredictable, though the archiver continues to use the same location until it runs out of space.

It unfortunately does not help post-mortem, but if you're debugging a live system, the zone-bundle tool will fish those log files out for you as well.

$ zone-bundle --host <SLED_AGENT_IP> get --create oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91

That command will create a tarball with all the log files, either still in /var/svc/log or out on those U.2s.