Node does not start when existing cluster node(s) is/are unresponsive

JChrist commented 4 years ago

Describe the bug When trying to start a new hazelcast node with the existing cluster node(s) being unresponsive (e.g. sending absolutely no responses whatsoever), the new node will fail to start, instead of claiming mastership and starting.

Expected behavior The new node to succeed in claiming mastership and starting.

To Reproduce

Steps to reproduce the behavior:

Initiate a simulation of an unresponsive cluster. simplest way possible is having something listen on hazelcast's port without responding, such as nc -l 0.0.0.0 5701 (thank you @vbekiaris !!!)

Try to start a new hazelcast instance using only tcp/ip join config, including the unresponsive member as possible member. Very simple programmatic config to achieve this:

Config config = new Config();
config.setClusterName("test");
config.getNetworkConfig().getInterfaces().setEnabled(false).clear();
var jc = config.getNetworkConfig().getJoin();
jc.getMulticastConfig().setEnabled(false);
jc.getTcpIpConfig().setEnabled(true).setMembers(Lists.newArrayList("127.0.0.1")).setConnectionTimeoutSeconds(1);

The new node will eventually print out that it failed to start.

Additional context This is reproducible with both 4.0.1 and 3.12.7 I tried to get to the bottom of this, but I'm missing some parts. The most I managed to find out is that in class TcpIpJoiner (line 209), inside method claimMastership(Collection<Address> possibleAddresses) this operation fails due to TcpIpEndpointManager always returning a null TcpIpConnection from the public TcpIpConnection getOrConnect(final Address address, final boolean silent) method (line 193) since the connection will eternally be considered as in-progress (it will exist inside its connectionsInProgress set). Also, the new node will not declare itself as the new master through TcpIpJoiner's (line 151 clusterJoinManager.setThisMemberAsMaster(); because it will never blacklist the misbehaving node, since this is only done in case the socket is not connected.

JChrist commented 4 years ago

I managed to write a (failing) test for this case (a ServerSocket on 5701 is enough). This is assumed to be inside TcpIpJoinTest

@Test
    public void test_whenNotRespondingNodeExists() throws Exception {
        Config config = new Config();
        config.setProperty(ClusterProperty.WAIT_SECONDS_BEFORE_JOIN.getName(), "0");
        config.setProperty(ClusterProperty.MAX_JOIN_SECONDS.getName(), "3");
        config.getNetworkConfig().getJoin().getMulticastConfig().setEnabled(false);
        config.getNetworkConfig().getInterfaces().setEnabled(false).clear();
        config.getNetworkConfig().getJoin().getTcpIpConfig().setEnabled(true)
                .setMembers(Collections.singletonList("127.0.0.1")).setConnectionTimeoutSeconds(1);

        //start listening on 127.0.0.1:5701 without ever responding to any packet
        final ServerSocket ss = new ServerSocket(5701);
        final AtomicBoolean stop = new AtomicBoolean(false);
        final Thread socketWatcher = new Thread(() -> {
            while (!stop.get()) {
                try {
                    ss.accept();
                } catch (Exception ignored) {
                }
            }
        });
        socketWatcher.start();

        try {
            HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);
            assertClusterSize(1, hz);
        } finally {
            stop.set(true);
            socketWatcher.interrupt();
            IOUtil.close(ss);
        }
    }

JChrist commented 4 years ago

I seem to fix this issue by adding an else block in TcpServerConnectionManager:204 similar to the following. Essentially, TcpServerConnectionManager checks that when a connection is requested that is marked as in-progress, it has received some bytes in the past.

else {
  connections.stream().filter(c -> c.getRemoteAddress().equals(address)).findFirst().ifPresent(c -> {
    final long maxReadWaitTimeMillis = 1_000L; // TODO: somehow make it configurable?
    if (Clock.currentTimeMillis() - c.lastReadTimeMillis() > maxReadWaitTimeMillis) {
      c.close("nothing received for the last: " + maxReadWaitTimeMillis + "ms", null);
      //this will not be called when closing the connection due to null cause
      //but null cause is needed to avoid reconnection mechanism kicking-in
      serverContext.onFailedConnection(address);
    }
  });
}

mmedenjak commented 4 years ago

Hi @JChrist ! Without going too much into the details here, I can say historically we had similar doubts whether we should fail-fast when something went wrong during the join process or whether the instance should start standalone. In some previous versions, an instance would start standalone while with this change it started to fail to start. And we said that's ok. I can say that, in some cases, it does make sense that an instance starts standalone but in others it might be confusing - users might inadvertently have a typo in the member list and have instances start standalone instead of forming a cluster.

So, added this to the current milestone to take a closer look but as it stands now, I'd say this is intended behaviour and we prefer fail-fast.

JChrist commented 4 years ago

There should at least be some kind of configuration option to handle such behavior. Something like max milliseconds to wait for a join response from a cluster member before blacklisting it.

shartte commented 3 years ago

We are having similar issues when tunnelling Hazelcast through stunnel. A remote "connection refused" becomes "connection reset by peer", which is not forwarded to the manager, leading to the peer never being blacklisted (preventing server startup).

This issue is going a bit further by also detecting nodes that just don't send anything...

On top of that, why is the exception not being passed to the life-cycle method here: https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/internal/server/tcp/TcpServerConnection.java#L250

In fact, i haven't found a call to that lifecycle method passing anything but null as the exception...

giuseppCl commented 1 year ago

We are having similar issues when tunnelling Hazelcast through stunnel. A remote "connection refused" becomes "connection reset by peer", which is not forwarded to the manager, leading to the peer never being blacklisted (preventing server startup).

This issue is going a bit further by also detecting nodes that just don't send anything...

On top of that, why is the exception not being passed to the life-cycle method here: https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/internal/server/tcp/TcpServerConnection.java#L250

In fact, i haven't found a call to that lifecycle method passing anything but null as the exception...

Hi, have you found a workaround for this? We have the same problem (Hazelcast+stunnel). However, it is difficult for us to intervene in the implementation as we use a ready-made software solution that itself uses Hazelcast.

shartte commented 1 year ago

Hi, have you found a workaround for this? We have the same problem (Hazelcast+stunnel). However, it is difficult for us to intervene in the implementation as we use a ready-made software solution that itself uses Hazelcast.

No. We chose to migrate off of Hazelcast instead.

wrmay commented 1 year ago

I tested this with 5.2 just now and I'm still seeing the same behavior. This happens if something accepts the socket connection but does not respond. If there is nothing listening on the socket at all then of course the cluster doesn't wait. There is an ICMP based failure detector that may work for you: https://docs.hazelcast.com/hazelcast/5.2/clusters/ping-detector . Since it does not rely on sockets, it at least won't have this exact problem.

hazelcast / hazelcast

Node does not start when existing cluster node(s) is/are unresponsive #17115