rackerlabs / otter

Rackspace Auto Scale
http://www.rackspace.com/cloud/auto-scale/
Other
53 stars 27 forks source link

Blow up somehow if there is no ServiceNet IP and a CLB is configured #868

Open cyli opened 9 years ago

cyli commented 9 years ago

The bug discovered https://github.com/rackerlabs/otter/pull/862 seems to indicate that if the server was created without a ServiceNet IP, (or without a valid one) then no failure occurs - it just doesn't get added to any CLBs.

Possibly now, a blank IP will be attempted to be added to a CLB. Possibly a failure should occur instead - at the very least, an error log message.

Based on discussions, this should have several tasks attached:

lvh commented 9 years ago

In the context of hard failures, soft failures, error states and retries, this is a hard failure (user error).

cyli commented 9 years ago

@lvh Hmm... since we allow changing launch configurations, maybe they had no CLB configured before, and this would have been valid previously?

If we're not versioning launch configurations and converging each state to its original desired configuration, then this may not be a user error so much as an indication that this server cannot be converged and may need to be replaced?

I'm not sure if this indicates maybe revisiting how we converge the lb state of old servers.

cyli commented 9 years ago

We should definitely certainly blow up if the user submits a launch configuration that has a CLB configured but no servicenet configured, though.

lvh commented 9 years ago

Yep, I'm thinking of the case that has CLB in the but no ServiceNet. The changing launch configuration just sounds like a special case of that: bottom line is that you have a server that you should attach to a CLB but can't because there's no ServiceNet.

Without doing too much thread necromancy, ISTR this is one of the points I was trying to make in that thread about convergence models, with tuples of launch configs and capacities. Now we have converge to capacity, which makes sense as long as you only touch the image (or cloud init, I suppose). Once you touch the networking stuff, all bets are off. The main outcome from that conversation was "be conservative", and I don't think we can reasonably solve the LB transitioning problem without also solving the rolling updates problem, at least not for the case you just mentioned. Since we don't do that yet, I'm suggesting we just give up when we hit that case.

In the long run, we should do something intelligent here. That could be a new ticket right now, or not :)

lvh commented 9 years ago

Once #869 is resolved, None will be passed as "no IP" instead of the empty string. I don't know how that affects the failure mode.

lvh commented 9 years ago

To reconstruct the failure, you want to undo this commit: https://github.com/rackerlabs/otter/commit/9ec9b701f224033fb338df66aabbd586d458b6e7

cyli commented 9 years ago

WRT to rolling updates vs just giving up: agree.

manishtomar commented 9 years ago

Can this be closed now that #879 is merged?

cyli commented 9 years ago

Probably the companion piece to #879 is to log in convergence if we encounter an old server without service net and that we are just going to give up on it.

I think we can probably close once that is in, also?

lvh commented 9 years ago

We probably also want to log that in the server's metadata somehow (can be generic if needed) so that when the fine day comes that we do automated rolling upgrades, we can hit the bad servers first. (Arguably, we could just kill the server anyway, since it's not working...)

cyli commented 9 years ago

Updated the description since, now that we are doing per-server load balancer configs and not just moving every server to new load balancers, we won't encounter the old server -> load balancer issue.

But it is still possible for the user to break things by manually removing servicenet, so we should probably still log.