oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
241 stars 36 forks source link

prevent expunged sleds from doing stuff #6405

Open davepacheco opened 3 weeks ago

davepacheco commented 3 weeks ago

The medium-to-long-term plan here is to use trust quorum to kick the sled out of the quorum. In the meantime we rely on an operator having powered off the sled and never putting it back into the rack. It might be nice to do something extra here in the short term (but see the detailed discussion in the RFD 457 section on "Ensuring a sled is really off").

andrewjstone commented 3 weeks ago

The medium-to-long-term plan here is to use trust quorum to kick the sled out of the quorum. In the meantime we rely on an operator having powered off the sled and never putting it back into the rack. It might be nice to do something extra here in the short term (but see the detailed discussion in the RFD 457 section on "Ensuring a sled is really off").

I'm kinda wondering if I can just get back to real trust quorum after the clickhouse zone work is done.

davepacheco commented 3 weeks ago

Yeah and it's fine if ticket covers "use the new trust quorum to kick the sled out".

I assume in the best case that's probably months away? So I do wonder if it'd be worth, say, adding an API to sled agent that causes it to wipe its ledgers, secrets, bootstore, etc. and reboot. If that's easy to do, it might be a nice failsafe in the meantime (and it might enable some automated testing around sled expungement). If that's not easy to do, this might just be a bad idea.

andrewjstone commented 3 weeks ago

Yeah and it's fine if ticket covers "use the new trust quorum to kick the sled out".

I assume in the best case that's probably months away? So I do wonder if it'd be worth, say, adding an API to sled agent that causes it to wipe its ledgers, secrets, bootstore, etc. and reboot. If that's easy to do, it might be a nice failsafe in the meantime (and it might enable some automated testing around sled expungement). If that's not easy to do, this might just be a bad idea.

Hmm, I actually am not sure that will work. My guess is that the LRTQ code will just retrieve it's existing share or a new one from another node. We could probably make a small addition to prevent that, and maybe that's a viable option depending upon timeline.