Get rid of retry limit for communication with coordination service

hseeberger / constructr

Coordinated (etcd, ...) cluster construction for dynamic (cloud, containers) environments

Apache License 2.0

212 stars 37 forks source link

Get rid of retry limit for communication with coordination service #138

Closed hseeberger closed 6 years ago

hseeberger commented 7 years ago

Currently ConstructrMachine will stop and hence the system terminate if in any state the number of retries for communicating with the coordination service is exceeded. While this might seem plausible for states like GettingNodes or AddingSelf, this is pretty drastic for Refreshing, because it means that an otherwise fully functional cluster member gets terminated because it can't talk to the coordination service. Looking at the Akka Cluster seed node process which by default endlessly retries joining, it becomes clear, that even for the initial states up to AddingSelf termination might be too hard a decision.

Please let me know what you are thinking!

raboof commented 7 years ago

For the majority of use cases this seems useful.

It does introduce the possibility of a split brain situation: when a running cluster somehow loses connectivity with the coordination service, the TTL expires, and a new node is started that can reach the coordination service, this new node will form a new cluster, right?

So for use cases where a split brain is catastrophic (worse than unavailability), it'd make sense to shut down when the coordination service cannot be refreshed before the TTL expires.

Retrying endlessly by default seems reasonable for most use cases, but allowing a maximum number of retries does seem useful.

hseeberger commented 7 years ago

Good point. I have to chew on this for a while ...

juanjovazquez commented 7 years ago

Basically agree with @raboof. We have several use cases where the more determinism the better and a split brain scenario would be a serious problem. I'd stick to retry limit.

beatkyo commented 7 years ago

I think a good improvement would be to let the node react before leave/shutdown. Thus allowing a clean shutdown.

I am using sharding and currently it does not seem possible to initiate a GracefulShutdown when this happen. Resulting in loss of state and unavailable shard regions.

hseeberger commented 7 years ago

I'm thinking about adding a barrier against multiple bootstrapping, i.e. allow the first node to join itself and form the cluster only once. This would make this feature here possible.

RomanIakovlev commented 7 years ago

In principle, the coordination service is only needed for new nodes to join. If this operation (new nodes joining) is automated and not monitored, then yes, there's a risk of split-brain situation. However, in many setups this is not the case, and making otherwise healthy node quit on Refresh failure makes the utility coordination service a single point of failure. At the minimum I'd introduce different retry policy for different ConstructrMachine states. Then one can use higher retry count for Refreshing state, effectively emulating the infinite retry for setups where this is desired.

hseeberger commented 7 years ago

@RomanIakovlev the most urgent use case for the coordination service is cluster formation, i.e. to decide that the first node joins itself: the coordination service is used to ensure that there can only be one

RomanIakovlev commented 7 years ago

@hseeberger right, I wanted to emphasize that I share the concern you've expressed in the original post in this issue. If existing cluster members lose connectivity to coordination service, they should not go down.

calvinlfer commented 7 years ago

It would be great to make this a configurable option so you can choose whether or not you want your application to be tied to the availability of the coordination service after bootstrapping.

lutzh commented 7 years ago

I agree that a healthy cluster should not shut itself down if etcd becomes unavailable, so a +1 for not shutting down if Refreshing times out.

A case was described where a split brain could occur: If the existing nodes can no longer access etcd, but a newly added node can. @hseeberger mentions putting in a barrier against multiple bootstrapping, with this, even in this case (as unlikely as it may seem in the first place) no split-brain would occur.

calvinlfer commented 6 years ago

@hseeberger not sure if you had time to look at the PR that @hrenovcik did but I think it's a good solution that gives the user the ability to decide to keep trying to reconnect infinitely and not taking a fully functioning cluster down because the seed-discovery mechanism went down or going with a finite retry count and taking the cluster down because the seed-discovery mechanism goes down.

hseeberger commented 6 years ago

Ah, missed that PR. Thanks for bringing it to my attention. I have merged it.

calvinlfer commented 6 years ago

hello @hseeberger, thanks for merging the PR, any idea on when you are planning to do a release?

hseeberger commented 6 years ago

Should I? ;-)

hseeberger commented 6 years ago

Done: released 0.18.0