Instance going into failed state after hitting instance stop timeout

We got a report in the field about an instance being marked failed at the end of a stop-instance request. The instance happened to be ephemeral in nature and the user's intent was to delete it anyway so there was no data loss. But it could be a bigger problem if the instance was meant to be kept around and powered up again for later use.

From the customer ticket, the suspected sequence of events was:

The instance began to come to a stop
Propolis successfully stopped the instance and destroyed the VMM
The instance runner began to execute its terminate function
In the intervening 25 minutes, an API request came to Nexus asking to stop the instance
Nexus asked sled agent to stop the instance; this did nothing and timed out because the instance runner was busy doing something in InstanceRunner::terminate_inner and so was not servicing new Nexus requests
Nexus's request to sled agent hit its 60-second client timeout, causing the instance to go to Failed
After this the user deleted the instance
Sled agent finally decided to tear down the Propolis zone and publish a state update to Nexus, producing the 404 Not Found we see in the Nexus logs

The time turned out to be spent on two back-to-back zone bundle creation for the instance in question and another instance on the same sled (which will be tracked in a separate issue). The problem reported here is about how sled-agent and Nexus interaction can be improved to avoid hitting the client timeout.

oxidecomputer / omicron

Instance going into failed state after hitting instance stop timeout #5235