Open abelmannu opened 3 years ago
@abelmannu I could not reproduce the issue. Could you write the exact steps who to reproduce the Split Brain issue you encounter? Preferably with our official Helm Chart.
Just got into exactly this situation while running unrelated unit test complemented with standalone node joining another standalone node. Found out that in TcpIpJoiner#504
SplitBrainMergeCheckResult result = sendSplitBrainJoinMessageAndCheckResponse(address, request);
if (result == SplitBrainMergeCheckResult.LOCAL_NODE_SHOULD_MERGE) {
result is always SplitBrainMergeCheckResult.REMOTE_NODE_SHOULD_MERGE
, which is not handled by any node.
Split brain scenario is simulated by changing networkConfig
.
Changing first node config does not work, changing both node config does not work.
Resolved by changing only second node networkConfig
:
instance.getConfig().getNetworkConfig().getJoin().getTcpIpConfig().addMember("127.0.0.1:6702").addMember("127.0.0.1:6712");
Strangely, related unit test com.hazelcast.cluster.SplitBrainHandlerTest
works well.
HZ 5.1.5 with 2 nodes in embedded JVM, on kube environment, not activating CP safe mode.
Have exactly same issue occurs sometimes after pods restarts, nodes connected but waiting for each other, and none of them joining cluster.
24 00:41:32,159 [hz.laughing_ramanujan.priority-generic-operation.thread-0] INFO c.h.i.c.i.ClusterJoinManager - [fdfb:85ef:26ff:b454:c1:d7a7:37a7:3633]:5701 [aaa] [5.1.5] [fdfb:85ef:26ff:d7e6:e1cd:93ee:c50d:5a32]:5701 should merge to us, both have the same data member count: 1
24 00:55:46,842 [hz.fervent_moore.priority-generic-operation.thread-0] INFO c.h.i.c.i.ClusterJoinManager - [fdfb:85ef:26ff:d7e6:e1cd:93ee:c50d:5a32]:5701 [aaa] [5.1.5] We should merge to [fdfb:85ef:26ff:b454:c1:d7a7:37a7:3633]:5701, both have the same data member count: 1
We are using hazelcast 4.2.1 in an kubernetes environment with openjdk:14-jdk-slim images. In our dev environment, where we only have two nodes, these two nodes sometimes (short after every 5th deployment) end up having a split brain condition and do not merge, although they find each other and agree on what to do:
The joiner of the first nodes says the second node should join. And the joiner of the second not it should join the first node. But nothing happens. The log repeats every couple of minutes and the clusters are not merged.
It does not matter if we use a merge policy or not. More often than not it works without any problems.
Log of first node:
Log of second node: