Closed hseeberger closed 6 years ago
For the majority of use cases this seems useful.
It does introduce the possibility of a split brain situation: when a running cluster somehow loses connectivity with the coordination service, the TTL expires, and a new node is started that can reach the coordination service, this new node will form a new cluster, right?
So for use cases where a split brain is catastrophic (worse than unavailability), it'd make sense to shut down when the coordination service cannot be refreshed before the TTL expires.
Retrying endlessly by default seems reasonable for most use cases, but allowing a maximum number of retries does seem useful.
Good point. I have to chew on this for a while ...
Basically agree with @raboof. We have several use cases where the more determinism the better and a split brain scenario would be a serious problem. I'd stick to retry limit.
I think a good improvement would be to let the node react before leave/shutdown. Thus allowing a clean shutdown.
I am using sharding and currently it does not seem possible to initiate a GracefulShutdown when this happen. Resulting in loss of state and unavailable shard regions.
I'm thinking about adding a barrier against multiple bootstrapping, i.e. allow the first node to join itself and form the cluster only once. This would make this feature here possible.
In principle, the coordination service is only needed for new nodes to join. If this operation (new nodes joining) is automated and not monitored, then yes, there's a risk of split-brain situation. However, in many setups this is not the case, and making otherwise healthy node quit on Refresh
failure makes the utility coordination service a single point of failure.
At the minimum I'd introduce different retry policy for different ConstructrMachine
states. Then one can use higher retry count for Refreshing
state, effectively emulating the infinite retry for setups where this is desired.
@RomanIakovlev the most urgent use case for the coordination service is cluster formation, i.e. to decide that the first node joins itself: the coordination service is used to ensure that there can only be one
@hseeberger right, I wanted to emphasize that I share the concern you've expressed in the original post in this issue. If existing cluster members lose connectivity to coordination service, they should not go down.
It would be great to make this a configurable option so you can choose whether or not you want your application to be tied to the availability of the coordination service after bootstrapping.
I agree that a healthy cluster should not shut itself down if etcd becomes unavailable, so a +1 for not shutting down if Refreshing
times out.
A case was described where a split brain could occur: If the existing nodes can no longer access etcd, but a newly added node can. @hseeberger mentions putting in a barrier against multiple bootstrapping, with this, even in this case (as unlikely as it may seem in the first place) no split-brain would occur.
@hseeberger not sure if you had time to look at the PR that @hrenovcik did but I think it's a good solution that gives the user the ability to decide to keep trying to reconnect infinitely and not taking a fully functioning cluster down because the seed-discovery mechanism went down or going with a finite retry count and taking the cluster down because the seed-discovery mechanism goes down.
Ah, missed that PR. Thanks for bringing it to my attention. I have merged it.
hello @hseeberger, thanks for merging the PR, any idea on when you are planning to do a release?
Should I? ;-)
Done: released 0.18.0
Currently
ConstructrMachine
will stop and hence the system terminate if in any state the number of retries for communicating with the coordination service is exceeded. While this might seem plausible for states likeGettingNodes
orAddingSelf
, this is pretty drastic forRefreshing
, because it means that an otherwise fully functional cluster member gets terminated because it can't talk to the coordination service. Looking at the Akka Cluster seed node process which by default endlessly retries joining, it becomes clear, that even for the initial states up toAddingSelf
termination might be too hard a decision.Please let me know what you are thinking!