nats-io / nats.java

Java client for NATS
Apache License 2.0
563 stars 153 forks source link

Connection failures during NATS servers rolling upgrade #1136

Closed ajax-surovskyi-y closed 1 month ago

ajax-surovskyi-y commented 3 months ago

Observed behavior

During the rolling upgrade process of our NATS servers, which consists of three nodes, several of our services were unable to reconnect to the NATS server. This issue occurred unexpectedly and affected the normal operation of the services. Investigation revealed several errors common across all services(java.net.SocketException: Connection reset , java.net.SocketException: Broken pipe, java.net.ConnectException: Connection refused), including those that successfully reconnected. However, a specific exception, java.util.concurrent.CancellationException, was found only on the nodes that failed to re-establish the connection.

The following exception was identified as unique to the failing nodes and might be directly related to the reconnection failures:

java.util.concurrent.CancellationException: null
    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2510)
    at io.nats.client.impl.NatsConnection.cleanUpPongQueue(NatsConnection.java:721)
    at io.nats.client.impl.NatsConnection.closeSocketImpl(NatsConnection.java:702)
    at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:566)
    at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:540)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.lang.Thread.run(Thread.java:1583)

Upon reviewing the source code of the library implicated in the error, I identified the method referenced by the stack trace:

void cleanUpPongQueue() {
    Future<Boolean> b;
    while ((b = pongQueue.poll()) != null) {
        try {
            b.cancel(true);
        } catch (CancellationException e) {
            if (!b.isDone() && !b.isCancelled()) {
                processException(e);
            }
        }
    }
}

It appears unnecessary to wrap the b.cancel(true) call within a try-catch block for CancellationException, as this exception does not occur during the cancel call but rather during get or join operations on a cancelled CompetableFuture. Further analysis of the code revealed that the CancellationException may not clearly indicate the actual failure point, as it tends to signal where the Future was cancelled, not where it was actively thrown. I identified only the one method where the get call on the CompetableFuture is used:

void tryToConnect(String serverURI, long now) {
    this.currentServer = null;
    try {
    ...
      Future<Boolean> pongFuture = sendPing();

      if (pongFuture != null) {
          pongFuture.get(timeoutNanos, TimeUnit.NANOSECONDS);
      }
    ...
    } catch (RuntimeException exp) { // runtime exceptions, like illegalArgs
        processException(exp);
        throw exp;
    } catch (Exception exp) { // everything else
        processException(exp);
        try {
            this.closeSocket(false);
        } catch (InterruptedException e) {
            processException(e);
        }
    }
    ...
}

In this scenario, if a CancellationException arises from the pongFuture.get call, it would typically be caught by the RuntimeException catch block. Notably, there is no specific handling for CancellationException, which is critical as this exception can disrupt the normal reconnect logic. If the reconnect logic is interrupted without proper exception handling, it may cause the reconnection process to halt completely, thus failing to re-establish connectivity.

It's a contentious decision to perform a rethrow if it is not handled higher up in the call stack. If there are no objections, I would make a contribution where I would remove this catch block.

Expected behavior

Automatically reconnect to the NATS server without any issues during the rolling upgrade.

Server and client version

nats-server 2.10.14 nats 2.15.3

Host environment

jvm amazon corretto 21.0.2

Steps to reproduce