oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
245 stars 38 forks source link

Instance stuck in Starting with a Propolis zone but no server process #3319

Open gjcolombo opened 1 year ago

gjcolombo commented 1 year ago

Seen in dogfood after creating an instance, stopping it, and trying to start it again. I don't have better repro steps right now.

The sled agent logs on this sled rotated at around the time the relevant Propolis was created. The old log shows sled agent telling Nexus that the VMM is starting, but then the log cuts off without a clear indication of why there's no Propolis process in the zone:

// gjc: this is the instance of interest

23:22:49.756Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
    state = InstanceRuntimeState { run_state: Starting, sled_id: 585d0736-3f48-45a2-a9c0-a43d20796a96, propolis_id: 091d25a0-35cc-4556-8b14-a6f6d82fe93d, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:105::26]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(2), memory: ByteCount(8589934592), hostname: "bootme", gen: Generation(11), time_updated: 2023-06-07T23:22:49.756949218Z }
23:22:49.838Z INFO SledAgent (PortManager): Created OPTE port
    port = Port { inner: PortInner { name: "opte6", _ip: 172.30.0.5, mac: MacAddr6([168, 64, 37, 248, 158, 101]), slot: 0, vni: Vni { inner: 2783369 }, gateway: Gateway { mac: MacAddr6([168, 64, 37, 255, 119, 119]), ip: 172.30.0.1 }, vnic: "vopte6" } }
23:22:49.891Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:49.923Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:52.755Z INFO SledAgent (InstanceManager): Zone booting
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
    zone = oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device until then. Error: DeleteVnicError { name: "vopte6", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte6", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }

// gjc: old log cuts off here
// gjc: new log begins here

BRM42220014 # looker -f ./oxide-sled-agent\:default.log   
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:105::1]:12345
    remote_addr = [fd00:1122:3344:106::4]:44379
23:57:59.026Z INFO SledAgent (InstanceManager): Mapping virtual NIC to physical host
    mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.8, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 252, 135, 30])), physical_host_ip: fd00:1122:3344:108::1, vni: Vni(7796804) }
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:105::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:106::4]:44379
    req_id = b1839181-d657-40ac-a432-f880158cc52b
    response_code = 204
    uri = /v2p/1c626997-d332-49a2-a81e-1c5cfcd3d600

The zone exists:

BRM42220014 # zoneadm list
global
oxz_switch
oxz_ntp
oxz_crucible_oxp_53c0c042-5225-42df-81cf-6dbecf8e2582
oxz_crucible_oxp_3d9dfd38-97b4-41de-b0d0-4d2034f463b7
oxz_crucible_oxp_3a555b82-8caa-4373-8ac7-6a7b1950ec0a
oxz_crucible_oxp_8a51df51-d69c-45e2-9b71-0aa7b4fd58f4
oxz_crucible_oxp_e4d786f7-5735-4426-8c82-8b4604ae5484
oxz_crucible_oxp_a58f6b4f-b1b4-4f09-a72f-7aa17e7bda2e
oxz_crucible_oxp_9ceafe70-961e-438a-91bb-eb6eec9d6318
oxz_crucible_oxp_c24e47d1-0aae-4018-a21e-b0a63787cf9e
oxz_crucible_oxp_074fb480-5745-4a42-a3cb-b1b465cbd57f
oxz_crucible_oxp_50fd51c4-12af-4ac2-bda7-1b245705245a
oxz_propolis-server_2f422cf8-393c-4401-9ce6-820cede5dec5
oxz_propolis-server_f60ad21f-bad6-4d1d-8876-1d0529e86c13
oxz_propolis-server_3e387595-e2f6-4f51-97c9-10f81e93c3c7
oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d

But it has no Propolis service:

root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~# svcs | grep propolis
root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~# 

And there are no leftover logs from a prior invocation of the service. It's been several hours since this happened, but at least right now there's plenty of ramdisk space available on the sled:

Filesystem             Size   Used  Available Capacity  Mounted on
rpool/ROOT/ramdisk    3.81G  1.32G      2.49G    35%    /

Asking the instance to stop while it's in this state currently pushes it into a zombie Destroyed state; that's probably a result of #3260, the fix for which merged today and hasn't been picked up into dogfood yet. (The zone is successfully torn down in this case, so if the instance were instead "Stopped"--as it should be once the fix for #3260 is in place--it should be possible to start it again.)

So the things to follow up on here are:

bnaecker commented 1 year ago

We've previously seen similar situations in issues like https://github.com/oxidecomputer/omicron/issues/1115 and https://github.com/oxidecomputer/omicron/issues/1120. The Propolis zone started, but the service itself failed very quickly for reasons that were sometimes hard to see. The core problem for those seemed to be resolved by https://github.com/oxidecomputer/omicron/pull/1124. It's possible that either did not fix the issue or that this is just unrelated.