What do we do when convergence fails when cleaning up a deleting group?

rackerlabs / otter

Rackspace Auto Scale

Other

53 stars 27 forks source link

Let's say for example that the user force deletes a group, and convergence is going through and cleaning up all the servers.

And CLB is in ERROR state, or some other unrecoverable error comes from CLB. Convergence fails, and the server can't be cleaned up.

The group is already marked DELETING, so it won't show up again in lists. Nothing can trigger convergence on it again.

What do we do in this case?

Also related - @manishtomar points out that this is a cause for NoSuchScalingGroup errors when converging - the converge cycle results in a FAILURE, and so the converger tries to write that state to the database. But the group is in deleting (modify state does not write to deleting groups), so it fails.

Discussion from slack:

We should handle all the CLB deleted errors that we know about
If there are other errors, like CLB going into ERROR, we just need to clean up what we can and maybe log to cloudfeeds that we couldn't clean up some resources. It's ok if they're then orphaned. We should delete the group anyway.
We can't just fix this by deleting the group whether or not convergence resulted in failure, because we have to clean up what resources we can. For instance: if CLB goes into error, we still need to delete the server, and if we just delete the group when convergence fails, that server would be left orphaned without convergence trying to delete it.

Right now, since draining is not enabled, we issue both the CLB remove nodes commands and the delete server commands simultaneously. Once we enable draining this may not be the case.
@radix points out that if we show a group as deleting to the user, we can maybe get around this by making the user decide how to handle the failure.

rackerlabs / otter

What do we do when convergence fails when cleaning up a deleting group? #1553