rackerlabs / otter

Rackspace Auto Scale
http://www.rackspace.com/cloud/auto-scale/
Other
53 stars 27 forks source link

Parse 409 errors from bulk adding to RCv3 #955

Open cyli opened 9 years ago

cyli commented 9 years ago

@kivattik and @sam-falvo discovered that if the server doesn't have the right network (the same one as the load balancer pool), adding to RCv3 will fail.

The problematic server may need to be deleted, if it's a new server. Again if it's a new server the group needs to go into error state, because the launch config is wrong. (We should probably try to validate this during launch config submission, but it might be possible for the network to go away in the interim too, so validation may not be much use).

If it's an old server, I guess we do nothing? :|

lvh commented 9 years ago

How does it fail? I'm actually writing a ticket right now suggesting that the return type of execute_convergence is wrong because it doesn't let you specify "nope, this is wedged, stop retrying". If we can specify that from any step, at least we can give up right now.

Perhaps the signature of the resulting effects is just wrong, and it should instead be an Effect [Effect], where the child effects specify what this effect suggests should happen next.

radix commented 9 years ago

Effects intrinsically allow specifying what happens next dynamically by allowing callbacks to return more Effects. What were you thinking those Effects in the list would do and what their result would be?

I would imagine the simplest way to implement "wedging" would be to have a special exception type that a Step's Effect could result in, that would set the group state to ERROR (and a reason, of course).

lvh commented 9 years ago

@radix Yeah, but if I return effects from a callback, they're going to get executed unconditionally, which is a waste of time if any of the sibling effects of that convergence cycle end up resulting in a hard failure, right?

radix commented 9 years ago

Yeah... I guess I don't fully understand your idea. I guess we should discuss it on the "implement erroring of groups in convergence" ticket

lvh commented 9 years ago

Does that ticket exist yet, or is that the one I'm writing?

radix commented 9 years ago

the one you're writing :)

manishtomar commented 9 years ago

If it's an old server, I guess we do nothing? :|

@cyli This is w.r.t convergence only right? If so, then I am guessing yes - we do nothing. However, we should log this based on outcome of #884.

cyli commented 9 years ago

@manishtomar Yes, that's true. :) And yes, it should definitely log.

sam-falvo commented 9 years ago

I could well be blowing smoke with this message, but I just had an idea.

One of the most commonly used metaphors for the effects system is that they're just like Haskell monads. The nice thing about Haskell monads is that they support programmable semicolons. I wrote http://sam-falvo.github.io/2007/03/22/haskell-monads-another-view/ to help record that epiphanal moment for me.

Why not have the effects dispatcher call a common, perhaps even well-known, callback whose job it is to detect a wedge, and report back the results to let the effects system make the choice of whether or not to continue executing effects? That way, you specify the error-detection (and global handling thereof) only once, perhaps even abstracted away from the user of the effects system, while at the same time retaining the ability to oversee proper execution regardless of how long or complex the effects chain is.


From: lvh [notifications@github.com] Sent: Friday, January 23, 2015 10:14 AM To: rackerlabs/otter Cc: Sam Falvo Subject: Re: [otter] Parse 409 errors from bulk adding to RCv3 (#955)

@radixhttps://github.com/radix Yeah, but if I return effects from a callback, they're going to get executed unconditionally, which is a waste of time if any of the sibling effects of that convergence cycle end up resulting in a hard failure, right?

— Reply to this email directly or view it on GitHubhttps://github.com/rackerlabs/otter/issues/955#issuecomment-71237379.

lvh commented 9 years ago

@sam-falvo: But figuring out what the error is is (at least) dependent on the kind of step, so...

lvh commented 9 years ago

FYI, I'm mostly solving this problem as a part of #820, that is, most of the logic for figuring out what's gone wrong will be in there.

lvh commented 9 years ago

Here are some of the documented cases I'm trying to cover:

cyli commented 9 years ago

So after running some manual tests, it seems like:

I've sent an email to the RCv3 team to see if it is possible to even do this any more

cyli commented 9 years ago

The reply is:

Not going to close this as invalid, as there are still errors to process.

But also, I've seen:

{
    "errors": [
        "Cloud Server b69a3f62-d4d6-4380-9c15-7d5f32ab5c18 does not exist"
    ]
}

If the ID is invalid (which it shouldn't be).

I've also tried adding a deleted server to a RCv3 node, and it succeeds... :| Then deleting it fails.