Open cyli opened 9 years ago
How does it fail? I'm actually writing a ticket right now suggesting that the return type of execute_convergence
is wrong because it doesn't let you specify "nope, this is wedged, stop retrying". If we can specify that from any step, at least we can give up right now.
Perhaps the signature of the resulting effects is just wrong, and it should instead be an Effect [Effect]
, where the child effects specify what this effect suggests should happen next.
Effects intrinsically allow specifying what happens next dynamically by allowing callbacks to return more Effects. What were you thinking those Effects in the list would do and what their result would be?
I would imagine the simplest way to implement "wedging" would be to have a special exception type that a Step's Effect could result in, that would set the group state to ERROR (and a reason, of course).
@radix Yeah, but if I return effects from a callback, they're going to get executed unconditionally, which is a waste of time if any of the sibling effects of that convergence cycle end up resulting in a hard failure, right?
Yeah... I guess I don't fully understand your idea. I guess we should discuss it on the "implement erroring of groups in convergence" ticket
Does that ticket exist yet, or is that the one I'm writing?
the one you're writing :)
If it's an old server, I guess we do nothing? :|
@cyli This is w.r.t convergence only right? If so, then I am guessing yes - we do nothing. However, we should log this based on outcome of #884.
@manishtomar Yes, that's true. :) And yes, it should definitely log.
I could well be blowing smoke with this message, but I just had an idea.
One of the most commonly used metaphors for the effects system is that they're just like Haskell monads. The nice thing about Haskell monads is that they support programmable semicolons. I wrote http://sam-falvo.github.io/2007/03/22/haskell-monads-another-view/ to help record that epiphanal moment for me.
Why not have the effects dispatcher call a common, perhaps even well-known, callback whose job it is to detect a wedge, and report back the results to let the effects system make the choice of whether or not to continue executing effects? That way, you specify the error-detection (and global handling thereof) only once, perhaps even abstracted away from the user of the effects system, while at the same time retaining the ability to oversee proper execution regardless of how long or complex the effects chain is.
From: lvh [notifications@github.com] Sent: Friday, January 23, 2015 10:14 AM To: rackerlabs/otter Cc: Sam Falvo Subject: Re: [otter] Parse 409 errors from bulk adding to RCv3 (#955)
@radixhttps://github.com/radix Yeah, but if I return effects from a callback, they're going to get executed unconditionally, which is a waste of time if any of the sibling effects of that convergence cycle end up resulting in a hard failure, right?
— Reply to this email directly or view it on GitHubhttps://github.com/rackerlabs/otter/issues/955#issuecomment-71237379.
@sam-falvo: But figuring out what the error is is (at least) dependent on the kind of step, so...
FYI, I'm mostly solving this problem as a part of #820, that is, most of the logic for figuring out what's gone wrong will be in there.
Here are some of the documented cases I'm trying to cover:
So after running some manual tests, it seems like:
I've sent an email to the RCv3 team to see if it is possible to even do this any more
The reply is:
Not going to close this as invalid, as there are still errors to process.
But also, I've seen:
{
"errors": [
"Cloud Server b69a3f62-d4d6-4380-9c15-7d5f32ab5c18 does not exist"
]
}
If the ID is invalid (which it shouldn't be).
I've also tried adding a deleted server to a RCv3 node, and it succeeds... :| Then deleting it fails.
@kivattik and @sam-falvo discovered that if the server doesn't have the right network (the same one as the load balancer pool), adding to RCv3 will fail.
The problematic server may need to be deleted, if it's a new server. Again if it's a new server the group needs to go into error state, because the launch config is wrong. (We should probably try to validate this during launch config submission, but it might be possible for the network to go away in the interim too, so validation may not be much use).
If it's an old server, I guess we do nothing? :|