Closed andrewjstone closed 9 months ago
I think this is not a bug, and instead a side effect of effectively telling the planner that you want 4 Nexus instances on what is effectively a 3-node system. The ordering here is:
Inputs to the planner in step 6 include:
What should the planner do given this set of inputs? The current implementation (which I think is correct based on conversations we've had so far) is that when generating a plan, if a sled doesn't yet have an NTP zone, the planner should give it an NTP zone and nothing else. So the planner does that, but then sees that the policy says there should be 4 Nexus instances. Therefore, it allocates an additional Nexus instance to one of the eligible sleds, which is one of the three that already has one Nexus instance.
If this were a larger system with at least one "eligible for service zones" sled that did not already have a Nexus, the planner would've put the Nexus there instead of on a sled that already has a Nexus. In this system there are no such sleds, so it's forced to double up. I don't think it would be correct for the planner to attempt to give a new sled an NTP zone and a Nexus zone in the same plan generation, right?
Couple more thoughts:
I had the same exact thought last night with respect to why this happened @jgallagher. Thanks for confirming.
While it's not strictly a bug in the current logic, the behavior leaves something to be desired on small clusters at least. I'm not sure how special of a case this actually is for production systems (or any large clusters) at least. For NTP, DNS, CRDB, Nexus, Clickhouse, and possibly more, we always want nodes/replicas on separate sleds. The only reason that doesn't happen in testing is due to the small cluster size of our testbeds and rack test systems. On a larger cluster we'd just go ahead and put the nexus node on one of the existing sleds without one.
So I think I concur with you that there's no bug here and no real reason to change the behavior. A special case could cause unanticipated consequences.
Another thought I had is that maybe we no longer need to order zones from the blueprint since https://github.com/oxidecomputer/omicron/pull/5012. I think this means we can go ahead and put all zones we want to run on an empty sled together in a single OmicronZonesRequest
. I'm not sure if this would result in trying to use that nexus when it's not ready though. In practice that can always happen due to issues with a sled where nexus is running or the network though, so I don't think that's a big deal.
My understanding was that if the fourth node made it far enough to have its NTP one deployed, the planner would put the Nexus instance on that one rather than th eothers. So the edge case here really is: the operator started with a 3-node cluster and added a 4th. If they had a smaller cluster, the overlap in services would be unavoidable. If they had a larger one, they'd get Nexus instances on separate nodes during RSS (well, assuming the RSS policy was also N=4). So I agree it's not worth doing anything special for this.
Closing as this is not a bug. However, I'm still curious about the following:
Another thought I had is that maybe we no longer need to order zones from the blueprint since https://github.com/oxidecomputer/omicron/pull/5012. I think this means we can go ahead and put all zones we want to run on an empty sled together in a single OmicronZonesRequest. I'm not sure if this would result in trying to use that nexus when it's not ready though. In practice that can always happen due to issues with a sled where nexus is running or the network though, so I don't think that's a big deal.
Yeah I'm curious about that too. I think the long-term behavior we talked about from sled agent is that we can give it all the zones, and it will start NTP first, wait for timesync, and start the rest. It could do this by doing it all in parallel, failing the ones that require timesync, and retrying. I'm not sure how #5012 changed all this, and particularly the behavior of rejecting requests with non-NTP zones before time is sync'd.
I created a 3 node testbed off the code on main, and then added node
g2
. I ran the planner enough times to deploy ntp and crucible zones to sledg2
. I then ran the planner to try to get it to place the nexus zone ong2
. Prior to testbed launch I had previously modified the code as instructed by @jgallagher so that there would be a 4th nexus instance to place by the planner.I noticed the zone wasn't being placed on sled
g2
when looking at blueprints and that the generation number for that sled was stuck at3
with only crucible and ntp zones. I then ranomdb nexus blueprints regenerate
a few times to see if the nexus zone would get placed on g2 after some time. I then realized that there was a second nexus running on sledg3
. I then went and ran some commands backtracking through the blueprints to see if the second nexus was actually placed after the first blueprint was constructed and it looks like it was. So it appears we have a bug in the planner. The omdb commands I used to diagnose this after a few calls toregenerate
are below.