oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

Instance going into failed state after hitting instance stop timeout #5235

Open askfongjojo opened 6 months ago

askfongjojo commented 6 months ago

We got a report in the field about an instance being marked failed at the end of a stop-instance request. The instance happened to be ephemeral in nature and the user's intent was to delete it anyway so there was no data loss. But it could be a bigger problem if the instance was meant to be kept around and powered up again for later use.

From the customer ticket, the suspected sequence of events was:

  1. The instance began to come to a stop
  2. Propolis successfully stopped the instance and destroyed the VMM
  3. The instance runner began to execute its terminate function
  4. In the intervening 25 minutes, an API request came to Nexus asking to stop the instance
  5. Nexus asked sled agent to stop the instance; this did nothing and timed out because the instance runner was busy doing something in InstanceRunner::terminate_inner and so was not servicing new Nexus requests
  6. Nexus's request to sled agent hit its 60-second client timeout, causing the instance to go to Failed
  7. After this the user deleted the instance
  8. Sled agent finally decided to tear down the Propolis zone and publish a state update to Nexus, producing the 404 Not Found we see in the Nexus logs

The time turned out to be spent on two back-to-back zone bundle creation for the instance in question and another instance on the same sled (which will be tracked in a separate issue). The problem reported here is about how sled-agent and Nexus interaction can be improved to avoid hitting the client timeout.

gjcolombo commented 6 months ago

See #5237 (I cross-posted with this issue; mea culpa) for more discussion of how this could be improved in sled agent.