oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

Sled Agent crashed in response to failed instance request #3454

Open smklein opened 1 year ago

smklein commented 1 year ago
          The downstairs were wiped out when the sled-agent crashed and restarted:
09:00:01.266Z INFO SledAgent (PortManager): Mapping virtual NIC to physical host
    mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.5, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 243, 152, 219])), physical_host_ip: fd00:1122
:3344:10a::1, vni: Vni(10225803) }
09:00:01.267Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:43840
    req_id = bcc4fef9-789c-4e73-a1f7-2f249ecde50c
    response_code = 204
    uri = /v2p/b7e955a4-36e3-4d74-ae3b-ad053ae8097b
09:00:01.816Z INFO SledAgent (InstanceManager): Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 }))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    zone = oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
09:00:02.006Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneEnsureAddress(EnsureAddressError(EnsureAddressError { zone: "oxz_propolis-server_
d00f74ec-80ea-4419-80e9-ec9b3bbf83f4", request: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })), name: AddrObject { interface: "oxControlIn
stance8", name: "omicron6" }, err: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm crea
te-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress

    Caused by:
        Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] exe
cuted and failed with status: exit status: 1  stdout: 
          stderr: ipadm: Could not create address: Addrconf already in progress })))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
09:00:02.007Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    state = InstanceRuntimeState { run_state: Failed, sled_id: 7230a95e-44ac-42ef-8dbd-1183d39193c7, propolis_id: d00f74ec-80ea-4419-80e9-ec9b3bbf83f4, dst_prop
olis_id: None, propolis_addr: Some([fd00:1122:3344:107::2a]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(4), memory: ByteCou
nt(2147483648), hostname: "web-instance-2", gen: Generation(3), time_updated: 2023-06-29T09:00:02.006892393Z }
09:00:02.052Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Failed to create address Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })) with name oxControlInstance8/omicron6
 in oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3b
bf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: \n  stderr: ipadm: Could not c
reate address: Addrconf already in progress
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:38064
    req_id = 6206c886-bea7-4e1f-8126-54c12ea873e0
    response_code = 500
    uri = /instances/f1e6ed32-cb42-4b71-a7ac-893ac46467f1/state
09:00:02.111Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:107::1]:12345
    remote_addr = [fd00:1122:3344:102::4]:41840
09:00:02.111Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_d00f74ec-80ea-4419-
80e9-ec9b3bbf83f4", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf8
3f4': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:535:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Jun 29 09:00:10 Stopping because all processes in service exited. ]
[ Jun 29 09:00:10 Executing stop method (:kill). ]
[ Jun 29 09:00:10 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jun 29 09:00:10 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
09:00:12.732Z INFO SledAgent: Starting mg-ddm service
09:00:12.798Z INFO SledAgent: Importing mg-ddm service
    path = /opt/oxide/mg-ddm/pkg/ddm/manifest.xml
09:00:13.023Z INFO SledAgent: Setting mg-ddm interfaces
    interfaces = ("cxgbe0/ll" "cxgbe1/ll")
09:00:13.044Z INFO SledAgent: Enabling mg-ddm service
09:00:13.070Z INFO SledAgent: setting up bootstrap agent server
09:00:13.166Z INFO SledAgent: Ensuring zfs key directory exists
    path = /var/run/oxide/
09:00:13.582Z INFO SledAgent: Sending prefix to ddmd for advertisement
    DdmAdminClient = [::1]:8000
    prefix = Ipv6Prefix { addr: fdb0:a840:2504:3d5::, len: 64 }
09:00:13.688Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_ntp
09:00:13.703Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_bd5d7d9f-58ca-4350-9083-6a92a6155a65
09:00:13.714Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_47d274ce-f4cb-4bc8-990a-b1460bd918c6
09:00:13.741Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_0cf8b90b-1143-4119-9012-1188c92036f2
09:00:13.756Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_7c1992a0-3f17-4672-b141-61ccab131c16
09:00:13.773Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_b155e4f4-facd-4a7b-a464-b965fc8e8cf5
09:00:13.787Z WARN SledAgent: Deleting existing zone
...

Aside from the delete-all-zones behavior (which is already being worked on), we probably also need to deal with the issue that the sled-agent crashed in face of an incompatible state error - "uninstall operation is invalid for shutting_down zones". The error handling can be less catastrophic.

Originally posted by @askfongjojo in https://github.com/oxidecomputer/omicron/issues/3451#issuecomment-1613590320

smklein commented 1 year ago

There are a couple sled agent issues to be figured out here:

  1. Why did the address allocation fail? What cause an addrconf creation to already be in-progress? IMO this is the "source" of the issue.
  2. Why could Sled agent not uninstall the zone? I think it's "correct" for the Sled Agent to try to clean up state from a failed instance provision, but it clearly needs to do so without crashing.
smklein commented 1 year ago

I think issue (1) is related to https://github.com/oxidecomputer/crucible/issues/818 - it looks like the same symptom

smklein commented 1 year ago

With #3456 in, the specific issue causing the crash should be mitigated, but we still need additional work to make Sled Agent more resilient to the general case of "an error happened when trying to start an instance".