oxidecomputer / management-gateway-service

Crates shared between MGS in omicron and its agent task in hubris
Mozilla Public License 2.0
3 stars 3 forks source link

SP reset timeout may be too short and should be runtime configurable #284

Closed jgallagher closed 5 days ago

jgallagher commented 6 days ago

We currently override whatever retry attempt is configured specifically for SP resets: https://github.com/oxidecomputer/management-gateway-service/blob/ca9f20e32acade9d15fb8d70405a6ef406c719cc/gateway-sp-comms/src/single_sp.rs#L1979-L1985

to allow up to 30 seconds, particularly for sidecar SPs which have historically taken "a while".

We saw a sidecar SP reset timeout today during a mupdate, and a subsequent retry succeeded but took 25 seconds. A likely explanation for the timeout is that there's some variance and 30 seconds was just too short. We should bump this 30 seconds up, and ideally also allow reset timeouts to be configured at runtime like the retry timeouts are for every other command.