Open gjcolombo opened 1 year ago
We've previously seen similar situations in issues like https://github.com/oxidecomputer/omicron/issues/1115 and https://github.com/oxidecomputer/omicron/issues/1120. The Propolis zone started, but the service itself failed very quickly for reasons that were sometimes hard to see. The core problem for those seemed to be resolved by https://github.com/oxidecomputer/omicron/pull/1124. It's possible that either did not fix the issue or that this is just unrelated.
Seen in dogfood after creating an instance, stopping it, and trying to start it again. I don't have better repro steps right now.
The sled agent logs on this sled rotated at around the time the relevant Propolis was created. The old log shows sled agent telling Nexus that the VMM is starting, but then the log cuts off without a clear indication of why there's no Propolis process in the zone:
The zone exists:
But it has no Propolis service:
And there are no leftover logs from a prior invocation of the service. It's been several hours since this happened, but at least right now there's plenty of ramdisk space available on the sled:
Asking the instance to stop while it's in this state currently pushes it into a zombie Destroyed state; that's probably a result of #3260, the fix for which merged today and hasn't been picked up into dogfood yet. (The zone is successfully torn down in this case, so if the instance were instead "Stopped"--as it should be once the fix for #3260 is in place--it should be possible to start it again.)
So the things to follow up on here are: