oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

image upload failed during saga step creating snapshot #3144

Open leftwo opened 1 year ago

leftwo commented 1 year ago

On dogfood, an image upload failed during the snapshot create step:

Here is a piece of the nexus log:

23:53:01.820Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): dataset Dataset { identity: DatasetIdentity { id: b14699c0-ce53-42d4-83e0-18609dfb210d, time_created: 2023-05-17T21:22:44.340490Z, time_modified: 2023-05-17T21:22:44.340490Z }, time_deleted: None, rcgen: Generation(Generation(1)), pool_id: 3e06a6f9-b7f3-498a-a9bb-0a56a48fdb4e, ip: fd00:1122:3344:103::c, port: SqlU16(32345), kind: Crucible, size_used: Some(1073741824) } region Region { identity: RegionIdentity { id: 9c1a43cd-3e4a-47f7-8742-2b721abd7ce1, time_created: 2023-05-17T23:52:59.725622Z, time_modified: 2023-05-17T23:52:59.725622Z }, dataset_id: b14699c0-ce53-42d4-83e0-18609dfb210d, volume_id: f3d4cf5f-8d22-4e80-a9a6-71f56be443a6, block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 16 } url http://[fd00:1122:3344:103::c]:32345
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.820Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): crucible region Region { block_size: 512, cert_pem: None, encrypted: true, extent_count: 16, extent_size: 131072, id: RegionId("9c1a43cd-3e4a-47f7-8742-2b721abd7ce1"), key_pem: None, port_number: 19000, root_pem: None, state: Created }
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.851Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): crucible snapshot Snapshot { created: 2023-05-17T23:53:01Z, name: "30054b49-866e-4ed3-8cf7-8cb2d2d85971" }
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.880Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): crucible running snapshot RunningSnapshot { id: RegionId("9c1a43cd-3e4a-47f7-8742-2b721abd7ce1"), name: "30054b49-866e-4ed3-8cf7-8cb2d2d85971", port_number: 19001, state: Requested }
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.880Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): map [fd00:1122:3344:103::c]:19000 to [fd00:1122:3344:103::c]:19001
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.881Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): dataset Dataset { identity: DatasetIdentity { id: b21bdeb3-e4e4-4c6e-b433-c75f21fe9b97, time_created: 2023-05-17T21:22:44.340486Z, time_modified: 2023-05-17T21:22:44.340486Z }, time_deleted: None, rcgen: Generation(Generation(1)), pool_id: 1c2e8947-be30-4726-8534-87ab89c764af, ip: fd00:1122:3344:106::b, port: SqlU16(32345), kind: Crucible, size_used: Some(1073741824) } region Region { identity: RegionIdentity { id: 6c1efbba-0382-4d3c-b9ce-bd66fd523348, time_created: 2023-05-17T23:52:59.725622Z, time_modified: 2023-05-17T23:52:59.725622Z }, dataset_id: b21bdeb3-e4e4-4c6e-b433-c75f21fe9b97, volume_id: f3d4cf5f-8d22-4e80-a9a6-71f56be443a6, block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 16 } url http://[fd00:1122:3344:106::b]:32345
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.882Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): crucible region Region { block_size: 512, cert_pem: None, encrypted: true, extent_count: 16, extent_size: 131072, id: RegionId("6c1efbba-0382-4d3c-b9ce-bd66fd523348"), key_pem: None, port_number: 19000, root_pem: None, state: Created }
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk
23:53:01.882Z DEBG 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): saga log event
    new_state = N018 failed
    sec_id = 8d4690c2-85c5-414b-b462-c42193c6324f
23:53:01.882Z DEBG 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: String("Error Response: status: 404 Not Found; headers: {\\"content-type\\": \\"application/json\\", \\"x-request-id\\": \\"688a4deb-6872-4be8-87c3-85b5332d97b2\\", \\"content-length\\": \\"84\\", \\"date\\": \\"Wed, 17 May 2023 23:53:01 GMT\\"}; value: Error { error_code: None, message: \\"Not Found\\", request_id: \\"688a4deb-6872-4be8-87c3-85b5332d97b2\\" }") })
    node_id = 18
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
23:53:01.884Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): update for saga cached state
    new_state = Unwinding
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    sec_id = 8d4690c2-85c5-414b-b462-c42193c6324f
23:53:01.884Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): updating state
    new_state = unwinding
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
leftwo commented 1 year ago

Full nexus log and abbreviated log here: /net/catacomb/data/staff/core/rack2/omicron-3144

leftwo commented 1 year ago

Pantry log at the same location: /net/catacomb/data/staff/core/rack2/omicron-3144

leftwo commented 1 year ago

The downstairs from the log message just before the reported failure:

23:53:01.882Z INFO 8d4690c2-85c5-414b-b462-c42193c6324f (ServerContext): crucible region Region { block_size: 512, cert_pem: None, encrypted: true, extent_count: 16, extent_size: 131072, id: RegionId("6c1efbba-0382-4d3c-b9ce-bd66fd523348"), key_pem: None, port_number: 19000, root_pem: None, state: Created }
    saga_id = fb1cc192-95a2-4645-89eb-51af104e0851
    saga_name = finalize-disk

Does not show any indication of a problem. (This was on sled 11)

root@oxz_crucible_oxp_1c2e8947-be30-4726-8534-87ab89c764af:~# cat $(svcs -L downstairs:downstairs-6c1efbba-0382-4d3c-b9ce-bd66fd523348) 
[ May 17 16:52:59 Disabled. ]
[ May 17 16:52:59 Enabled. ]
[ May 17 16:52:59 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
May 17 23:52:59.991 INFO Opened existing region file "/data/regions/6c1efbba-0382-4d3c-b9ce-bd66fd523348/region.json"
May 17 23:53:00.000 INFO UUID: 6c1efbba-0382-4d3c-b9ce-bd66fd523348
May 17 23:53:00.000 INFO Blocks per extent:131072 Total Extents: 16
May 17 23:53:00.000 INFO Crucible Version: Crucible Version: 0.0.1
Commit SHA: 9f69deab230079a5bae4a10a099d7fc1f3d29df7
Commit timestamp: 2023-05-10T15:02:44.000000000Z  branch: main
rustc: 1.66.1 stable x86_64-unknown-illumos
Cargo: x86_64-unknown-illumos  Debug: false Opt level: 3, task: main
May 17 23:53:00.000 INFO Upstairs <-> Downstairs Message Version: 2, task: main
May 17 23:53:00.000 INFO Using address: [fd00:1122:3344:106::b]:19000, task: main
May 17 23:53:00.000 INFO Repair listens on [fd00:1122:3344:106::b]:23000, task: repair
May 17 23:53:00.000 INFO listening, local_addr: [fd00:1122:3344:106::b]:23000, task: repair
May 17 23:53:00.001 INFO Using repair address: [fd00:1122:3344:106::b]:23000, task: main
May 17 23:53:00.001 INFO No SSL acceptor configured, task: main
May 17 23:53:00.001 INFO listening on [fd00:1122:3344:106::b]:19000, task: main
May 17 23:53:00.186 INFO accepted connection from [fd00:1122:3344:104::11]:34520, task: main
May 17 23:53:00.187 INFO Connection request from ec4728c6-f624-4595-88d7-48a8711846d3 with version 2, task: proc
May 17 23:53:00.187 INFO upstairs UpstairsConnection { upstairs_id: ec4728c6-f624-4595-88d7-48a8711846d3, session_id: e6617bc3-6daa-4236-88d2-76f32d731d9b, gen: 1 } connected, version 2, task: proc
May 17 23:53:00.187 INFO UpstairsConnection { upstairs_id: ec4728c6-f624-4595-88d7-48a8711846d3, session_id: e6617bc3-6daa-4236-88d2-76f32d731d9b, gen: 1 } is now active (read-write)
Current flush_numbers [0..12]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
May 17 23:53:00.188 INFO Downstairs has completed Negotiation, task: proc
May 17 23:53:02.327 INFO Flush and snap request received

And, the snapshot is present:

root@oxz_crucible_oxp_1c2e8947-be30-4726-8534-87ab89c764af:~# zfs list -t snapshot
NAME                                                                                                                                  USED  AVAIL     REFER  MOUNTPOINT
oxp_1c2e8947-be30-4726-8534-87ab89c764af/crucible/regions/6c1efbba-0382-4d3c-b9ce-bd66fd523348@30054b49-866e-4ed3-8cf7-8cb2d2d85971   526K      -     1.36M  -
root@oxz_crucible_oxp_1c2e8947
leftwo commented 1 year ago

From the pantry log, I see the pantry connect to the three downstairs for this request:

May 17 23:53:00.187 INFO [0] client is_active_req TRUE, promote! session e6617bc3-6daa-4236-88d2-76f32d731d9b
May 17 23:53:00.187 INFO [2] downstairs client at [fd00:1122:3344:108::6]:19000 has UUID 641e8183-7ca6-42de-90d1-c78214db6db4
May 17 23:53:00.187 INFO [2] Got region def RegionDefinition { block_size: 512, extent_size: Block { value: 131072, shift: 9 }, extent_count: 16, uuid: 641e8183-7ca6-42de-90d1-c78214db6db4, encrypted: true }
May 17 23:53:00.187 INFO ec4728c6-f624-4595-88d7-48a8711846d3 WaitActive WaitActive WaitActive
May 17 23:53:00.187 INFO [1] downstairs client at [fd00:1122:3344:106::b]:19000 has UUID 6c1efbba-0382-4d3c-b9ce-bd66fd523348
May 17 23:53:00.187 INFO [1] Got region def RegionDefinition { block_size: 512, extent_size: Block { value: 131072, shift: 9 }, extent_count: 16, uuid: 6c1efbba-0382-4d3c-b9ce-bd66fd523348, encrypted: true }
May 17 23:53:00.187 INFO ec4728c6-f624-4595-88d7-48a8711846d3 WaitActive WaitActive WaitActive
May 17 23:53:00.187 INFO [0] downstairs client at [fd00:1122:3344:103::c]:19000 has UUID 9c1a43cd-3e4a-47f7-8742-2b721abd7ce1
May 17 23:53:00.187 INFO [0] Got region def RegionDefinition { block_size: 512, extent_size: Block { value: 131072, shift: 9 }, extent_count: 16, uuid: 9c1a43cd-3e4a-47f7-8742-2b721abd7ce1, encrypted: true }
May 17 23:53:00.187 INFO ec4728c6-f624-4595-88d7-48a8711846d3 WaitActive WaitActive WaitActive

That session finishes without incident:

May 17 23:53:00.189 INFO ec4728c6-f624-4595-88d7-48a8711846d3 is now active with session: e6617bc3-6daa-4236-88d2-76f32d731d9b
May 17 23:53:00.189 INFO ec4728c6-f624-4595-88d7-48a8711846d3 Set Active after no repair
May 17 23:53:00.189 INFO Notify all downstairs, region set compare is done.
May 17 23:53:00.189 INFO [2] received reconcile message
May 17 23:53:00.189 INFO [2] All repairs completed, exit
May 17 23:53:00.189 INFO [2] Starts cmd_loop
May 17 23:53:00.189 INFO [0] received reconcile message
May 17 23:53:00.189 INFO [1] received reconcile message
May 17 23:53:00.189 INFO [0] All repairs completed, exit
May 17 23:53:00.189 INFO [1] All repairs completed, exit
May 17 23:53:00.189 INFO [1] Starts cmd_loop
May 17 23:53:00.189 INFO [0] Starts cmd_loop
May 17 23:53:00.549 INFO accepted connection, remote_addr: [fd00:1122:3344:104::4]:33112, local_addr: [fd00:1122:3344:104::11]:17000, component: dropshot
May 17 23:53:00.613 INFO request completed, response_code: 204, uri: /crucible/pantry/0/volume/ec4728c6-f624-4595-88d7-48a8711846d3/bulk_write, method: POST, req_id: b1856f50-2aed-480a-a76d-94330889acc5, remote_addr: [fd00:1122:3344:104::4]:33112, local_addr: [fd00:1122:3344:104::11]:17000, component: dropshot
May 17 23:53:01.741 INFO accepted connection, remote_addr: [fd00:1122:3344:104::4]:62457, local_addr: [fd00:1122:3344:104::11]:17000, component: dropshot
May 17 23:53:01.741 INFO flush with snap requested
May 17 23:53:01.796 INFO request completed, response_code: 204, uri: /crucible/pantry/0/volume/ec4728c6-f624-4595-88d7-48a8711846d3/snapshot, method: POST, req_id: 2f1a6cf0-c0e8-45f3-af0a-174bae4f56f7, remote_addr: [fd00:1122:3344:104::4]:62457, local_addr: [fd00:1122:3344:104::11]:17000, component: dropshot