oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 36 forks source link

Marking a sled non_provisionable causes existing instances to lose their VPC v2p mappings #5872

Closed askfongjojo closed 1 month ago

askfongjojo commented 3 months ago

I ran into the issue after marking a sled as non_provisionable by running

oxide system hardware sled set-provision-policy --sled-id 1efda86b-caef-489f-9792-589d7677e59a --state provisionable

My intention was to drain instances from the sled so that it could be expunged later on. After the sled policy was updated, an instance in it was no longer able to reach other private IPs in its VPC subnet. Running opteadm dump-v2p showed that the v2p mapping of this instance was gone. Once I put the sled back to provisionable, the mapping was populated again and the instance was back to being reachable from/to other instances on the same subnet.

Although setting a sled as non_provisionable could express an intent to eliminate all instances on the sled, it's also possible that the sled has some temporary problems (e.g. pending faulty component replacement) and the operator doesn't intend to remove all its existing instances (by stopping/starting them or live-migrating them). Also, it appears that the v2p mapping drop is solely because the sled query excludes all non_provisionable sleds. It's unclear if the behavior is by design.

@davepacheco / @internet-diglett - Thoughts?

davepacheco commented 3 months ago

This does seem like a bug to me and it would certainly suck to hit this unexpectedly during whatever support activity might cause us to use this flag. But I think @internet-diglett and @jgallagher understand the considerations here better than I do.

jgallagher commented 3 months ago

I think that's right; v2p mappings should still go to sleds that are in-service but not provisionable, because our definition for that policy is "still in service but new resources should not be provisioned onto it".

On the surface this looks like a bug in the v2p_mapping_view where we require s.sled_policy = 'in_service', and we could fix it in the short term by expanding that to also allow no_provision. However, we expect to add more policies in the future, and keeping that view up to date seems a little hazardous. For queries from Rust, this is why the SledFilter type exists: ideally we could add a SledFilter::V2PMappings variant, and keep the logic for which policy/state pairs match that filter in the same place. SledFilter also statically guarantees that new policies will always be updated to take all variants into consideration, since it uses an explicit match to apply itself. But I think doing this would require reworking the view - maybe the view could return the state + policy and then we could apply a filter from Rust? Although at that point I'm not sure the view is buying us much vs implementing it as a method on DataStore?

internet-diglett commented 3 months ago

@jgallagher I'm totally cool with us dropping the view and going to a rust-based implementation. I wasn't aware of these additional cases, or aware of the SledFilter type. The view might still be useful for dealing with the probes though.

internet-diglett commented 3 months ago

@askfongjojo just merged #5917, which should resolve this bug.

internet-diglett commented 2 months ago

During #5917 I accidentally broke RPW support for probes (should not impact instances), fixed in #5960