nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.56k stars 1.39k forks source link

NATS Push Consumer Client Stuck after NATS JetStream Node Failure #4624

Closed stefanLeo closed 9 months ago

stefanLeo commented 11 months ago

What version were you using?

Server: 2.9.22 & 2.10 Java Client: 2.16.14 & 2.17.0

What environment was the server running in?

We used the official NATS container images and the HELM charts for deployment.

Is this defect reproducible?

Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster Start producer with settings as described > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java Start consumer with settnigs as adescribed > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java Kill Jetstream leader node (find stream leader via using nats cli) The Consumer may not even be directly connected to the jestream leader node.

Given the capability you are leveraging, describe your expectation?

The Consuming client continues consuming messages after a leader has been elected.

Given the expectation, what is the defect you are observing?

We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).

Setup: In Memory Storage Option, 3 Replicas.

Config of createStream was changed to: StreamConfiguration sc = StreamConfiguration.builder() .name(streamName) .storageType(storageType) .subjects(subjects) .replicas(3) .description("LifeX-Test") .build();

and connection builder to Options.Builder builder = new Options.Builder() .server(servers) .connectionTimeout(Duration.ofMillis(500)) .pingInterval(Duration.ofSeconds(3)) .maxPingsOut(2) .reconnectWait(Duration.ofMillis(500)) .connectionListener(EXAMPLE_CONNECTION_LISTENER) .traceConnection() .errorListener(EXAMPLE_ERROR_LISTENER) .maxReconnects(-1) .reconnectDelayHandler(new PsReconnectDelayHandler()) .reconnectJitter(Duration.ofMillis(500));

When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).

Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.

After the failure of the NATS master node the following happens:

Producer detects the failure, reconnects and continues sending after ~15 seconds. Consumer DOES NOT detect the failure and is stuck. It does not log any error or disconnection info nor any new message reception in the message handler. Note that if we restart the consumer it can consume ALL message sent by the producer incl. the ones after the producer reconnected. Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed. Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details. The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...

See also https://github.com/nats-io/nats.java/issues/997

We think now this is a NATS Server issue.

stefanLeo commented 11 months ago

And an update - we fixed a few config issues but are now at a point where the client succesfully reconnects and esubscribes, but with the following results: PULL Consumer: Stuck after Resubscribe PUSH Consumer: SOMETIMES! failover is as expected - we immediately receive messages after the res-subscribe, but SOMETIMES the consumer is stuck waiting for the next message after re-subscriptoin for 22seconds until it receives the next message.

The NATS Server logs only that the "consumer is slow" during the 22 sec blackout.

derekcollison commented 11 months ago

Might be good to moe up to latest java client release and server relase.

derekcollison commented 11 months ago

If issue still is occurring we can take a closer look.

stefanLeo commented 11 months ago

@derekcollison : Already done - sorry for not updating. We have tested with 2.10 server and 2.17.0 java client with the exact same results.

derekcollison commented 11 months ago

ok, and you use the official HELM chart to deploy?

Looping in @scottf and @wallyqs from our side to assist.

stefanLeo commented 11 months ago

@derekcollison : YES can confirm.

scottf commented 11 months ago

@stefanLeo Is this the same issue you filed in the Java client repo? https://github.com/nats-io/nats.java/issues/997 I've been trying to work with you there. If it is the same issue, I'd like to close this one as a duplicate.

stefanLeo commented 11 months ago

@scottf : Yes -I only created one for NATS Server because I thought we are at the point were this is more likely to be a server side issue and then the ticket on the java client side would have been the wrong place.

scottf commented 10 months ago

@stefanLeo This is addressed in these PRs: https://github.com/nats-io/nats.java/pull/1043 / https://github.com/nats-io/nats.java/pull/1045

This affects the following consumers.

  1. Durable Push Consumer
  2. Ephemeral Push Consumer with a long enough inactive threshold to survive a server outage or disconnection
  3. Ordered Push Consumer
  4. Durable Pull Consumer when used with Simplification API
  5. Ephemeral Pull Consumer with a long enough inactive threshold to survive a server outage or disconnection when used with Simplification API
  6. Ordered Pull Consumer (only available in Simplification)
stefanLeo commented 9 months ago

@scottf : Thx for the update. I had a quick look at the changes - quite a lot :-) Very well appreciated! Let me know if we can help anyhow - I am also happy to run a test on our side with a SNAPSHOT version or so to give you feedback. Let me know where I could get such a version in case.

One question & one request though: 1) Can you help us understand a bit better what exactly the "simple API" is? Is it what we find in this examples session? https://github.com/nats-io/nats.java/tree/main/src/examples/java/io/nats/examples/jetstream/simple Is there any other documentation or marking in the code for it? 2) What can we expect from those changes? An automatic failover done from the java lib itself?

THX again!!

scottf commented 9 months ago

2.17.2-SNAPSHOT should be available, gradle/maven instructions here in the readme https://github.com/nats-io/nats.java#using-gradle

stefanLeo commented 9 months ago

@scottf : I gave it a try and reused this code sample with the snapshot version: https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/simple/FetchMessagesExample.java

Result:

Logs of the client during the failure are attached.

nats-test-2.17.2-SNAPSHOT.log

Note on the update: I can set the resubscribe startsequence via ConsumerConfigBuilder - sorry my bad.

scottf commented 9 months ago

Fetch isn't going to recover because it's not endless. Once Fetch returns a null, it's done. Time might have expired anyway. I should have made my note refer to endless simplification, sorry for the confusion.

scottf commented 9 months ago

I'm closing this since it is not a server issue. We can continue discussions in the java repo: https://github.com/nats-io/nats.java/issues/997