Open cyli opened 9 years ago
In the context of hard failures, soft failures, error states and retries, this is a hard failure (user error).
@lvh Hmm... since we allow changing launch configurations, maybe they had no CLB configured before, and this would have been valid previously?
If we're not versioning launch configurations and converging each state to its original desired configuration, then this may not be a user error so much as an indication that this server cannot be converged and may need to be replaced?
I'm not sure if this indicates maybe revisiting how we converge the lb state of old servers.
We should definitely certainly blow up if the user submits a launch configuration that has a CLB configured but no servicenet configured, though.
Yep, I'm thinking of the case that has CLB in the but no ServiceNet. The changing launch configuration just sounds like a special case of that: bottom line is that you have a server that you should attach to a CLB but can't because there's no ServiceNet.
Without doing too much thread necromancy, ISTR this is one of the points I was trying to make in that thread about convergence models, with tuples of launch configs and capacities. Now we have converge to capacity, which makes sense as long as you only touch the image (or cloud init, I suppose). Once you touch the networking stuff, all bets are off. The main outcome from that conversation was "be conservative", and I don't think we can reasonably solve the LB transitioning problem without also solving the rolling updates problem, at least not for the case you just mentioned. Since we don't do that yet, I'm suggesting we just give up when we hit that case.
In the long run, we should do something intelligent here. That could be a new ticket right now, or not :)
Once #869 is resolved, None
will be passed as "no IP" instead of the empty string. I don't know how that affects the failure mode.
To reconstruct the failure, you want to undo this commit: https://github.com/rackerlabs/otter/commit/9ec9b701f224033fb338df66aabbd586d458b6e7
WRT to rolling updates vs just giving up: agree.
Can this be closed now that #879 is merged?
Probably the companion piece to #879 is to log in convergence if we encounter an old server without service net and that we are just going to give up on it.
I think we can probably close once that is in, also?
We probably also want to log that in the server's metadata somehow (can be generic if needed) so that when the fine day comes that we do automated rolling upgrades, we can hit the bad servers first. (Arguably, we could just kill the server anyway, since it's not working...)
Updated the description since, now that we are doing per-server load balancer configs and not just moving every server to new load balancers, we won't encounter the old server -> load balancer issue.
But it is still possible for the user to break things by manually removing servicenet, so we should probably still log.
The bug discovered https://github.com/rackerlabs/otter/pull/862 seems to indicate that if the server was created without a ServiceNet IP, (or without a valid one) then no failure occurs - it just doesn't get added to any CLBs.
Possibly now, a blank IP will be attempted to be added to a CLB. Possibly a failure should occur instead - at the very least, an error log message.
Based on discussions, this should have several tasks attached: