oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

Instance stuck in `starting` later stuck in `stopping` #3451

Open benjaminleonard opened 1 year ago

benjaminleonard commented 1 year ago

On dogfood, created the instance through Terraform, got stuck in starting. Later after trying to stop the instance through the CLI, it switches to being stuck in stopping.

Also I found whilst it cannot be deleted in these states, it did manage to delete the network interfaces. Should this be prevented since the network interface cannot be edited on a stopped instance.

You can find the log at: /net/catacomb/data/staff/dogfood/jun-29/38b7d88a-02a8-402a-9088-8d6dd6896a65-stuck-starting.log

benjaminleonard commented 1 year ago

The propolis log can also be found at: /net/catacomb/data/staff/dogfood/jun-29/system-illumos-propolis-server:vm-eedb768f-be35-465f-ae87-5e9e9c477a1c.log

leftwo commented 1 year ago

Looking at the propolis log, I see a bunch of messages about connecting to a specific downstairs:

{"msg":"[1] connecting to [fd00:1122:3344:107::c]:19003","v":0,"name":"crucible","level":30,"time":"2023-06-29T14:18:51.869801593Z","hostname":"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid":2774,"looper":"1"}

So, it's having trouble connecting to a downstairs. let's find the UUIDs for all the downstairs we connected to:

grep "downstairs client at" /net/catacomb/data/staff/dogfood/jun-29/system-illumos-propolis-server:vm-eedb768f-be35-465f-ae87-5e9e9c477a1c.log
v{":"msg"0,"name:"":[2] downstairs client at [fd00:1122:3344:104::a]:19013 has UUID 71442910-fa52-4752-89d5-c28e47b8a4da""propolis-server","v":,"0,"name":"crucible",level":30"level":30,"time":"2023-06-29T09:00:40.580986256Z","hostname":,"time":"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c""2023-06-29T09:00:40.580986707Z",,""hostname":"pid"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid:2774":2774,"}
,{"name"msg"":"propolis-server":","[1] downstairs client at [fd00:1122:3344:106::9]:19013 has UUID db7ac8e4-f9a8-408f-8724-afba3985555e"level":30,"v":0,"name":"crucible","level":30,"time":"2023-06-29T09:00:40.581072709Z",,""time":hostname":""2023-06-29T09:00:40.581074062Z"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","hostname":"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid","pid":2774}
v{":0,"name":"propolis-server","msg":""level":30[0] downstairs client at [fd00:1122:3344:109::5]:19013 has UUID af2b5b90-4e4f-424b-9203-e898cade6226","v":0,"name":"crucible","level":30,"time":"2023-06-29T09:00:40.581142831Z","hostname":","time":oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid":2774"2023-06-29T09:00:40.581144274Z","hostname":"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid":2774}
{{"msg":"[1] Starts cmd_loop","v":0,"name":"crucible","level":30"msg":"[2] downstairs client at [fd00:1122:3344:109::c]:19003 has UUID f5d3abb3-6fcf-454d-ae9d-5dc35f6ac6b8","v":0,","time":"name"2023-06-29T09:00:40.60475183Z":"crucible","hostname":","level":30oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid":2774}
09:00:40.604Z INFO crucible: [0] downstairs client at [fd00:1122:3344:10a::6]:19003 has UUID c3ad7373-b8f2-428b-af6a-b60952809701

Pulling out from that blob, we have 5 matches really:

[0] downstairs client at [fd00:1122:3344:10a::6]:19003 has UUID c3ad7373-b8f2-428b-af6a-b60952809701
???
[2] downstairs client at [fd00:1122:3344:109::c]:19003 has UUID f5d3abb3-6fcf-454d-ae9d-5dc35f6ac6b8

[0] downstairs client at [fd00:1122:3344:109::5]:19013 has UUID af2b5b90-4e4f-424b-9203-e898cade6226
[1] downstairs client at [fd00:1122:3344:106::9]:19013 has UUID db7ac8e4-f9a8-408f-8724-afba3985555e
[2] downstairs client at [fd00:1122:3344:104::a]:19013 has UUID 71442910-fa52-4752-89d5-c28e47b8a4da

So, It looks like one of the downstairs never connected! This would mean that the upstairs never activated, which would prevent the instance from leaving the starting state (Thanks to @jordanhendricks for confirming that)

To get a little more info about the "missing" downstairs, we need to go back a bit further. We can find the Volume propolis got from Nexus:

grep Volume /net/catacomb/data/staff/dogfood/jun-29/system-illumos-propolis-server:vm-eedb768f-be35-465f-ae87-5e9e9c477a1c.log
{"msg":"Creating Crucible disk from request Volume { id: 0b339155-1dbc-4b0c-8e8b-07abd470d30b, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 160, opts: CrucibleOpts { id: 0b339155-1dbc-4b0c-8e8b-07abd470d30b, target: [[fd00:1122:3344:109::5]:19013, [fd00:1122:3344:106::9]:19013, [fd00:1122:3344:104::a]:19013], lossy: false, flush_timeout: None, key: Some(\"TCoVSmtaSmg+fJxlPvpK8JCWek95ZWIjEx0FmxCixaU=\"), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 1 }], read_only_parent: Some(Volume { id: 87683b11-ae3e-41c9-9357-9d91ec582492, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 48, opts: CrucibleOpts { id: 364fbfd9-023d-4a04-9cfd-a9eba671b252, target: [[fd00:1122:3344:10a::6]:19003, [fd00:1122:3344:107::c]:19003, [fd00:1122:3344:109::c]:19003], lossy: false, flush_timeout: None, key: Some(\"3rf43pls4ddWh1LQiBbeVe227tmpdjWI7330N8Dn/KA=\"), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: true }, gen: 18 }], read_only_parent: None }) }","v":0,"name":"propolis-server","level":30,"time":"2023-06-29T09:00:40.341916071Z","hostname":"oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c","pid":2774}

This we can tease apart and get the missing downstairs, though we really know it from the log message we started from.

Volume {
    id: 0b339155-1dbc-4b0c-8e8b-07abd470d30b,
    block_size: 512,
    sub_volumes: [
        Region {
            block_size: 512,
            blocks_per_extent: 131072,
            extent_count: 160,
            opts: CrucibleOpts {
                id: 0b339155-1dbc-4b0c-8e8b-07abd470d30b,
                target: [
                    [fd00:1122:3344:109::5]:19013,
                    [fd00:1122:3344:106::9]:19013,
                    [fd00:1122:3344:104::a]:19013
                ],
                lossy: false,
                flush_timeout: None,
                 key: Some(\"TCoVSmtaSmg+fJxlPvpK8JCWek95ZWIjEx0FmxCixaU=\"),
                 cert_pem: None,
                 key_pem: None,
                 root_cert_pem: None,
                 control: None,
                 read_only: false
            },   
            gen: 1
        }        
    ],           
    read_only_parent: Some(
        Volume { 
            id: 87683b11-ae3e-41c9-9357-9d91ec582492,
            block_size: 512,
            sub_volumes: [
                Region {
                    block_size: 512,
                    blocks_per_extent: 131072,
                    extent_count: 48,
                    opts: CrucibleOpts {
                        id: 364fbfd9-023d-4a04-9cfd-a9eba671b252,
                        target: [
                            [fd00:1122:3344:10a::6]:19003,
                            [fd00:1122:3344:107::c]:19003,
                            [fd00:1122:3344:109::c]:19003
                        ],
                        lossy: false,
                        flush_timeout: None,
                        key: Some(\"3rf43pls4ddWh1LQiBbeVe227tmpdjWI7330N8Dn/KA=\"),
                        cert_pem: None,
                        key_pem: None,
                        root_cert_pem: None,
                        control: None,
                        read_only: true },
                    gen: 18
                }       
            ],          
            read_only_parent: None
        }   
    )   
}

So, next step is to try and find where [fd00:1122:3344:107::c]:19003 is (or should be) This would have been one of the downstairs for the read only parent part of the volume.

jordanhendricks commented 1 year ago

When propolis-server receives a request to start the instance, it does a blocking attempt to start all of the VM's entities (ref), which include the VM's devices, such as crucible disks. If one of those entities gets stuck trying to start up, then the instance will be stuck in the Starting state.

Looking at the propolis server logs, we see that the crucible upstairs got stuck waiting for its downstairs to reach quorum, as one of the downstairs seemed to go away and never return.

For example, we see a lot of pings fail to downstairs 1:

[2023-06-29T09:00:41.609061257Z]  INFO: crucible/2774 on oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c: Checking if live repair is needed
[2023-06-29T09:00:45.608902734Z]  INFO: crucible/2774 on oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c: [0] Received Imok
[2023-06-29T09:00:45.609388039Z]  INFO: crucible/2774 on oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c: [2] Received Imok
[2023-06-29T09:00:50.541888181Z]  INFO: crucible/2774 on oxz_propolis-server_eedb768f-be35-465f-ae87-5e9e9c477a1c: connect timeout (looper=1)

@leftwo is further investigating what happened to that downstairs. But from a propolis-server and crucible upstairs perspective, I think this all makes sense.

This also doesn't explain why the instance got stuck in Stopping, which I am still looking into.

leftwo commented 1 year ago

To find our missing downstairs that should be at [fd00:1122:3344:107::c]:19003, lets look for the sled that has that "107" network:

root@oxz_switch:~# pilot host exec -c 'ipadm | grep fd00:1122:3344:107 || true' 8 9 10 11 12 16 17 24 26 27
 8  BRM44220011        ok: underlay0/sled6   static   ok           fd00:1122:3344:107::1/64
 9  BRM44220005        ok: 
10  BRM42220009        ok: 
11  BRM42220006        ok: 
12  BRM42220057        ok: 
16  BRM42220014        ok: 
17  BRM42220017        ok: 
24  BRM42220081        ok: 
26  BRM42220046        ok: 
27  BRM42220026        ok: 

So, sled 8 is where it "should" be. Over on sled 8, we have 10 crucible zones, which has our "c" network:

BRM44220011 # for zz in $(zoneadm list | grep oxz_crucible); do echo -n "$zz "; zlogin $zz ipadm | grep 107; done
oxz_crucible_oxp_bd5d7d9f-58ca-4350-9083-6a92a6155a65 oxControlService1/omicron6 static ok    fd00:1122:3344:107::3/64
oxz_crucible_oxp_47d274ce-f4cb-4bc8-990a-b1460bd918c6 oxControlService2/omicron6 static ok    fd00:1122:3344:107::4/64
oxz_crucible_oxp_0cf8b90b-1143-4119-9012-1188c92036f2 oxControlService3/omicron6 static ok    fd00:1122:3344:107::5/64
oxz_crucible_oxp_7c1992a0-3f17-4672-b141-61ccab131c16 oxControlService4/omicron6 static ok    fd00:1122:3344:107::6/64
oxz_crucible_oxp_b155e4f4-facd-4a7b-a464-b965fc8e8cf5 oxControlService5/omicron6 static ok    fd00:1122:3344:107::7/64
oxz_crucible_oxp_284ec851-0af6-4b31-a450-99026d5bb283 oxControlService6/omicron6 static ok    fd00:1122:3344:107::8/64
oxz_crucible_oxp_0bf59fa2-4014-4ffd-a0af-edde9833f910 oxControlService7/omicron6 static ok    fd00:1122:3344:107::9/64
oxz_crucible_oxp_871fc0f6-9c8d-4b12-9638-dc49053a9394 oxControlService8/omicron6 static ok    fd00:1122:3344:107::a/64
oxz_crucible_oxp_b58aecad-ed65-4189-aedc-44f44ec1a781 oxControlService9/omicron6 static ok    fd00:1122:3344:107::b/64
oxz_crucible_oxp_d24fb5a5-9347-4d74-bace-3e7fd3d7a80e oxControlService10/omicron6 static ok   fd00:1122:3344:107::c/64

This shows us oxz_crucible_oxp_d24fb5a5-9347-4d74-bace-3e7fd3d7a80e is our zone.

leftwo commented 1 year ago

Something interesting, but not sure if it is related is that this sled has rebooted more recently than the other sleds in the cluster:

root@oxz_switch:~# pilot host exec -c 'uptime' 8 9 10 11 12 16 17 24 26 27
 8  BRM44220011        ok: 17:34:49    up  8:34,  1 user,  load average: 2.95, 2.96, 2.96
 9  BRM44220005        ok: 17:34:49    up 18:42,  0 users,  load average: 1.68, 1.73, 1.71
10  BRM42220009        ok: 17:34:49    up 18:43,  3 users,  load average: 2.05, 2.06, 2.05
11  BRM42220006        ok: 17:34:49    up 18:44,  0 users,  load average: 2.08, 2.16, 2.16
12  BRM42220057        ok: 17:34:50    up 18:44,  0 users,  load average: 1.21, 1.21, 1.21
16  BRM42220014        ok: 17:34:50    up 18:44,  1 user,  load average: 3.34, 3.36, 3.36
17  BRM42220017        ok: 17:34:50    up 18:44,  1 user,  load average: 1.32, 1.36, 1.36
24  BRM42220081        ok: 17:34:50    up 18:44,  0 users,  load average: 1.81, 1.79, 1.77
26  BRM42220046        ok: 17:34:50    up 18:44,  0 users,  load average: 2.91, 2.96, 2.96
27  BRM42220026        ok: 17:34:51    up 18:43,  0 users,  load average: 2.84, 2.82, 2.81
askfongjojo commented 1 year ago

The downstairs were wiped out when the sled-agent crashed and restarted:

09:00:01.266Z INFO SledAgent (PortManager): Mapping virtual NIC to physical host
    mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.5, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 243, 152, 219])), physical_host_ip: fd00:1122
:3344:10a::1, vni: Vni(10225803) }
09:00:01.267Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:43840
    req_id = bcc4fef9-789c-4e73-a1f7-2f249ecde50c
    response_code = 204
    uri = /v2p/b7e955a4-36e3-4d74-ae3b-ad053ae8097b
09:00:01.816Z INFO SledAgent (InstanceManager): Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 }))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    zone = oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
09:00:02.006Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneEnsureAddress(EnsureAddressError(EnsureAddressError { zone: "oxz_propolis-server_
d00f74ec-80ea-4419-80e9-ec9b3bbf83f4", request: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })), name: AddrObject { interface: "oxControlIn
stance8", name: "omicron6" }, err: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm crea
te-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress

    Caused by:
        Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] exe
cuted and failed with status: exit status: 1  stdout: 
          stderr: ipadm: Could not create address: Addrconf already in progress })))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
09:00:02.007Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    state = InstanceRuntimeState { run_state: Failed, sled_id: 7230a95e-44ac-42ef-8dbd-1183d39193c7, propolis_id: d00f74ec-80ea-4419-80e9-ec9b3bbf83f4, dst_prop
olis_id: None, propolis_addr: Some([fd00:1122:3344:107::2a]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(4), memory: ByteCou
nt(2147483648), hostname: "web-instance-2", gen: Generation(3), time_updated: 2023-06-29T09:00:02.006892393Z }
09:00:02.052Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Failed to create address Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })) with name oxControlInstance8/omicron6
 in oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3b
bf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: \n  stderr: ipadm: Could not c
reate address: Addrconf already in progress
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:38064
    req_id = 6206c886-bea7-4e1f-8126-54c12ea873e0
    response_code = 500
    uri = /instances/f1e6ed32-cb42-4b71-a7ac-893ac46467f1/state
09:00:02.111Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:107::1]:12345
    remote_addr = [fd00:1122:3344:102::4]:41840
09:00:02.111Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_d00f74ec-80ea-4419-
80e9-ec9b3bbf83f4", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf8
3f4': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:535:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Jun 29 09:00:10 Stopping because all processes in service exited. ]
[ Jun 29 09:00:10 Executing stop method (:kill). ]
[ Jun 29 09:00:10 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jun 29 09:00:10 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
09:00:12.732Z INFO SledAgent: Starting mg-ddm service
09:00:12.798Z INFO SledAgent: Importing mg-ddm service
    path = /opt/oxide/mg-ddm/pkg/ddm/manifest.xml
09:00:13.023Z INFO SledAgent: Setting mg-ddm interfaces
    interfaces = ("cxgbe0/ll" "cxgbe1/ll")
09:00:13.044Z INFO SledAgent: Enabling mg-ddm service
09:00:13.070Z INFO SledAgent: setting up bootstrap agent server
09:00:13.166Z INFO SledAgent: Ensuring zfs key directory exists
    path = /var/run/oxide/
09:00:13.582Z INFO SledAgent: Sending prefix to ddmd for advertisement
    DdmAdminClient = [::1]:8000
    prefix = Ipv6Prefix { addr: fdb0:a840:2504:3d5::, len: 64 }
09:00:13.688Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_ntp
09:00:13.703Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_bd5d7d9f-58ca-4350-9083-6a92a6155a65
09:00:13.714Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_47d274ce-f4cb-4bc8-990a-b1460bd918c6
09:00:13.741Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_0cf8b90b-1143-4119-9012-1188c92036f2
09:00:13.756Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_7c1992a0-3f17-4672-b141-61ccab131c16
09:00:13.773Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_b155e4f4-facd-4a7b-a464-b965fc8e8cf5
09:00:13.787Z WARN SledAgent: Deleting existing zone
...

Aside from the delete-all-zones behavior (which is already being worked on), we probably also need to deal with the issue that the sled-agent crashed in face of an incompatible state error - "uninstall operation is invalid for shutting_down zones". The error handling can be less catastrophic.