Closed stefanLeo closed 9 months ago
And an update - we fixed a few config issues but are now at a point where the client succesfully reconnects and esubscribes, but with the following results: PULL Consumer: Stuck after Resubscribe PUSH Consumer: SOMETIMES! failover is as expected - we immediately receive messages after the res-subscribe, but SOMETIMES the consumer is stuck waiting for the next message after re-subscriptoin for 22seconds until it receives the next message.
The NATS Server logs only that the "consumer is slow" during the 22 sec blackout.
Might be good to moe up to latest java client release and server relase.
If issue still is occurring we can take a closer look.
@derekcollison : Already done - sorry for not updating. We have tested with 2.10 server and 2.17.0 java client with the exact same results.
ok, and you use the official HELM chart to deploy?
Looping in @scottf and @wallyqs from our side to assist.
@derekcollison : YES can confirm.
@stefanLeo Is this the same issue you filed in the Java client repo? https://github.com/nats-io/nats.java/issues/997 I've been trying to work with you there. If it is the same issue, I'd like to close this one as a duplicate.
@scottf : Yes -I only created one for NATS Server because I thought we are at the point were this is more likely to be a server side issue and then the ticket on the java client side would have been the wrong place.
@stefanLeo This is addressed in these PRs: https://github.com/nats-io/nats.java/pull/1043 / https://github.com/nats-io/nats.java/pull/1045
This affects the following consumers.
@scottf : Thx for the update. I had a quick look at the changes - quite a lot :-) Very well appreciated! Let me know if we can help anyhow - I am also happy to run a test on our side with a SNAPSHOT version or so to give you feedback. Let me know where I could get such a version in case.
One question & one request though: 1) Can you help us understand a bit better what exactly the "simple API" is? Is it what we find in this examples session? https://github.com/nats-io/nats.java/tree/main/src/examples/java/io/nats/examples/jetstream/simple Is there any other documentation or marking in the code for it? 2) What can we expect from those changes? An automatic failover done from the java lib itself?
THX again!!
2.17.2-SNAPSHOT should be available, gradle/maven instructions here in the readme https://github.com/nats-io/nats.java#using-gradle
@scottf : I gave it a try and reused this code sample with the snapshot version: https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/simple/FetchMessagesExample.java
Result:
Logs of the client during the failure are attached.
Note on the update: I can set the resubscribe startsequence via ConsumerConfigBuilder - sorry my bad.
Fetch isn't going to recover because it's not endless. Once Fetch returns a null, it's done. Time might have expired anyway. I should have made my note refer to endless simplification, sorry for the confusion.
I'm closing this since it is not a server issue. We can continue discussions in the java repo: https://github.com/nats-io/nats.java/issues/997
What version were you using?
Server: 2.9.22 & 2.10 Java Client: 2.16.14 & 2.17.0
What environment was the server running in?
We used the official NATS container images and the HELM charts for deployment.
Is this defect reproducible?
Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster Start producer with settings as described > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java Start consumer with settnigs as adescribed > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java Kill Jetstream leader node (find stream leader via using nats cli) The Consumer may not even be directly connected to the jestream leader node.
Given the capability you are leveraging, describe your expectation?
The Consuming client continues consuming messages after a leader has been elected.
Given the expectation, what is the defect you are observing?
We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).
Setup: In Memory Storage Option, 3 Replicas.
Config of createStream was changed to: StreamConfiguration sc = StreamConfiguration.builder() .name(streamName) .storageType(storageType) .subjects(subjects) .replicas(3) .description("LifeX-Test") .build();
and connection builder to Options.Builder builder = new Options.Builder() .server(servers) .connectionTimeout(Duration.ofMillis(500)) .pingInterval(Duration.ofSeconds(3)) .maxPingsOut(2) .reconnectWait(Duration.ofMillis(500)) .connectionListener(EXAMPLE_CONNECTION_LISTENER) .traceConnection() .errorListener(EXAMPLE_ERROR_LISTENER) .maxReconnects(-1) .reconnectDelayHandler(new PsReconnectDelayHandler()) .reconnectJitter(Duration.ofMillis(500));
When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).
Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.
After the failure of the NATS master node the following happens:
Producer detects the failure, reconnects and continues sending after ~15 seconds. Consumer DOES NOT detect the failure and is stuck. It does not log any error or disconnection info nor any new message reception in the message handler. Note that if we restart the consumer it can consume ALL message sent by the producer incl. the ones after the producer reconnected. Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed. Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details. The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...
See also https://github.com/nats-io/nats.java/issues/997
We think now this is a NATS Server issue.