oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
242 stars 36 forks source link

New instances were placed on an incompletely initialized sled #5530

Open askfongjojo opened 5 months ago

askfongjojo commented 5 months ago

A newly added sled which didn't succeed in getting its NTP zone created (#5502) was still in use by nexus for new instance placement. Its state was set to active in the sled table once it has gone through the omdb --destructive nexus sleds add step.

According to RFD 457, the process should involve a state transition from initializing to in-service to prevent using a sled that is partially initialized for customer workload or other service zones.

jgallagher commented 5 months ago

According to RFD 457, the process should involve a state transition from initializing to in-service to prevent using a sled that is partially initialized for customer workload or other service zones.

We don't yet have an initializing sled state; having one (and using it correctly) should fix this.

I wanted to clarify one thing though - particularly because of #5502, this sled should not have been eligible for new crucible regions - the sled was only chosen to run instances, not place disks, right?

askfongjojo commented 5 months ago

That's correct. Disks didn't land on the sled because the dataset records haven't been inserted into the CRDB table which disk placement query goes against.

davepacheco commented 4 months ago

In today's update call we discussed whether this was an R8 blocker and concluded that it wasn't because in practice for this release we'll be adding a sled during a maintenance window when provisioning will not be enabled. We won't re-enable provisioning in this intermediate state.