oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

Can't create a disk larger than 1023 GiB #3129

Open leftwo opened 1 year ago

leftwo commented 1 year ago

In testing on the dogfood rack, I found I could create a disk up to 1023 GiB, but any larger was too much for the crucible downstairs to create.

leftwo commented 1 year ago

Nexus log here: /net/catacomb/data/staff/core/rack2/omicron-3129/nexus.log

Nexus ends with this:

19:36:01.707Z INFO d24831fc-5e19-45e7-8f05-b5fc2a9f0af4 (ServerContext): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga error at node "regions_ensure": Failed to create region, unexpected state: Failed
    local_addr = 172.30.1.5:443
    method = POST
    remote_addr = 172.20.16.118:61488
    req_id = 97f6ccf1-ccb7-400c-8fe0-db1bce85f7ea
    response_code = 500
    uri = https://venus.oxide-preview.com/v1/disks?project=myproj

So, to figure out what happened, we have to find the agent log for the region that failed to create.

leftwo commented 1 year ago

As part of region create, a dataset is printed in the log:

        Dataset {
            identity: DatasetIdentity {
                id: d61021d3-266b-4d63-bc8a-c4aa9cb95772,
                time_created: 2023-05-14T16:51:20.417101Z,
                time_modified: 2023-05-14T16:51:20.417101Z,
            },  
            time_deleted: None,
            rcgen: Generation(
                Generation(
                    1,
                ),  
            ),  
            pool_id: eae1e475-faa3-4b57-bb32-f4da8ee0fe20,
            ip: fd00:1122:3344:105::b,
            port: SqlU16(
                32345,
            ),
            kind: Crucible,
            size_used: Some(
                1503238553600,
            ),
        },

That has a pool ID: eae1e475-faa3-4b57-bb32-f4da8ee0fe20

Over on the switch zone on the dogfood rack, we can search all hosts for one that has a zfs filesystem with that pool ID:

BRM42220051-switch # pilot host exec -c "zfs list | grep eae1e475-faa3-4b57-bb32-f4da8ee0f || true" 8 10 11 12 13 16 17 18 19 21 23 24 26
 8  BRM44220011        ok: 
10  BRM42220009        ok: 
11  BRM42220057        ok: 
12  BRM42220018        ok: 
13  BRM44220005        ok: 
16  BRM42220086        ok: 
17  BRM42220006        ok: 
18  BRM44220004        ok: 
19  BRM42220017        ok: 
21  BRM42220026        ok: 
23  BRM42220031        ok: 
24  BRM42220014        ok: oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20                                                             1.12G  2.81T       24K  /oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/crucible                                                     885M  2.81T     26.5K  /data
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/crucible/regions                                             885M  2.81T       27K  /data/regions
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/crucible/regions/04d7254c-081e-4366-9258-35d953bc0418        227M  2.81T      227M  /data/regions/04d7254c-081e-4366-9258-35d953bc0418
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/crucible/regions/b18bcc64-5984-4daf-a590-f569903874ae        329M  2.81T      329M  /data/regions/b18bcc64-5984-4daf-a590-f569903874ae
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/crucible/regions/d9949eca-192b-488f-b623-fbd4a96ac8b0        329M  2.81T      329M  /data/regions/d9949eca-192b-488f-b623-fbd4a96ac8b0
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/zone                                                         233M  2.81T       34K  /pool/ext/eae1e475-faa3-4b57-bb32-f4da8ee0fe20/zone
oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20/zone/oxz_crucible_oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20   233M  2.81T      233M  /pool/ext/eae1e475-faa3-4b57-bb32-f4da8ee0fe20/zone/oxz_crucible_oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20
26  BRM42220016        ok: 

This tells us that cubby 24, BRM42220014 has our pool. Additionally, we can tell it's zone oxz_crucible_oxp_eae1e475-faa3-4b57-bb32-f4da8ee0fe20

Going to that gimlet, then logging into the zone, I can access the agent logs, and from there, find our error:

May 15 19:35:26.110 INFO accepted connection, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:26.111 INFO region b18bcc64-5984-4daf-a590-f569903874ae state: Requested, component: datafile
May 15 19:35:26.111 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: b90d8ddc-1ef2-411b-8725-b91f9b101f19, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:26.229 INFO creating region Region { id: RegionId("b18bcc64-5984-4daf-a590-f569903874ae"), state: Requested, block_size: 4096, extent_size: 16384, extent_count: 22400, encrypted: true, port_number: 19002, cert_pem: None, key_pem: None, root_pem: None } at "/data/regions/b18bcc64-5984-4daf-a590-f569903874ae", region: b18bcc64-5984-4daf-a590-f569903874ae, component: worker
May 15 19:35:26.281 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: 5d83a4ff-3add-4900-96cb-d78949386374, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:26.536 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: 6ac8d15a-a55d-4905-a81a-a386cbef60c8, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:27.214 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: 83491342-7a6c-47e8-8f20-98931333f2dc, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:28.696 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: b1ec34e6-9681-4727-bc72-a7f82eba764a, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:33.704 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: cb688c4c-f9c2-4eb1-a2d7-84fc7ac0e4eb, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:42.158 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: 9fac29b1-8051-4c34-abdb-c7640ac7cb52, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot
May 15 19:35:47.743 ERRO downstairs create failed: out "" err "May 15 19:35:26.258 INFO Created new region file \"/data/regions/b18bcc64-5984-4daf-a590-f569903874ae/region.json\"\nError: Too many open files (os error 24)\n", region: b18bcc64-5984-4daf-a590-f569903874ae, component: worker
May 15 19:35:47.743 ERRO region "b18bcc64-5984-4daf-a590-f569903874ae" create failed: region files create failure, component: worker
May 15 19:35:47.743 INFO region b18bcc64-5984-4daf-a590-f569903874ae state: Requested -> Failed, component: datafile
May 15 19:35:47.743 INFO applying SMF actions post create..., component: worker
May 15 19:35:47.760 INFO SMF ok!, component: worker
May 15 19:35:57.250 INFO request completed, response_code: 200, uri: /crucible/0/regions, method: POST, req_id: 1e4fd33c-4184-4c5a-ae32-ead8dd8c820f, remote_addr: [fd00:1122:3344:10a::4]:58792, local_addr: [fd00:1122:3344:105::b]:32345, component: dropshot

The old too many open files.