oxidecomputer / management-gateway-service

Crates shared between MGS in omicron and its agent task in hubris
Mozilla Public License 2.0
3 stars 3 forks source link

`SingleSp::set_power_state` should probably be a bit more idempotent #270

Open hawkw opened 2 months ago

hawkw commented 2 months ago

At present, if one attempts to set a SP to a power state that it's already in, one receives a 503 Service Unavailable. For instance, attempting to update Sled 14 on madrid failed when wicketd attempted to put it in A2, because the system was already in A2:

[         mupdate] [sled 14 00:00:00]   Running  7/19) Setting host power state to A2
[         mupdate] [sled 14 00:00:00]    Failed  7/19) Setting host power state to A2: after 2.38ms: updating power state failed
[         mupdate] [sled 14 00:00:00]             Caused by:
[         mupdate] [sled 14 00:00:00]             - Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "f64e6790-ccfa-43e0-a629-f1ee71d32c5b", "content-length": "233", "date": "Sun, 28 Dec 1986 00:05:40 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 14 }: Error response from SP: power state error (code 1))", request_id: "f64e6790-ccfa-43e0-a629-f1ee71d32c5b" }
[         mupdate] [sled 14 00:00:00]  Terminal Execution failed

This appears to be because the gimlet-seq task in Hubris returns an error for A2 -> A2 transitions. See this match: https://github.com/oxidecomputer/hubris/blob/2f24dd4419634ca95742de901d18e52ff14b912b/drv/gimlet-seq-server/src/main.rs#L696-L918

gateway-sp-comms' SingleSp::set_power_state method just tries to perform the RPC, bubbling up the error from the SP without checking whether it's already in the desired state:

https://github.com/oxidecomputer/management-gateway-service/blob/7518095686028d80b8125de62ba7d179cbc2a683/gateway-sp-comms/src/single_sp.rs#L652-L657

aaaaaaand finally, over in Omicron, the MGS HTTP API just bubbles up whatever error SingleSp::set_power_state returns: https://github.com/oxidecomputer/omicron/blob/d96ea7ca8945b8ad78a53fd083850ea39789e5f0/gateway/src/http_entrypoints.rs#L623-L637

We should probably make this method succeed when the host is already in the desired state, instead. Although we could make the sequencer task in Hubris responsible for this, it seems to me that it might make more sense to do it here in MGS?