nats-io / nats.java

Java client for NATS
Apache License 2.0
564 stars 154 forks source link

NATS timeout errors when calling NatsJetStreamManagement #980

Closed joachimglink closed 2 months ago

joachimglink commented 1 year ago

Observed behavior

As described in Slack channel https://natsio.slack.com/archives/CM3T6T7JQ/p1692790156946559 we´re sometimes seeing timeout errors when we try to fetch information from Jetstream.

We see that the request goes into the NATS server but the response is never received. Not 100% sure if this is a server or client issue. Please move to another repo if this one doesn´t match.

We even see these timeouts if we increase the timeout threshold to 10 or even 30 seconds. As this is reproducable in small tests without a high load on the system, the system resources shouldn´t be the problem here.

Expected behavior

All requests against the NatsJetStreamManagement should be answered.

Server and client version

Experienced same behavior on different server (2.8.x / 2.9.x and 2.10.0) and client (jnats 2.14.x, 2.16.x) versions.

Host environment

Windows; NATS server running in Docker (Docker Desktop for Windows) GKE cluster

Steps to reproduce

nats-timeout-reproducer.zip

The attached ZIP contains a simple reproducer:

container-starter This module creates a Testcontainer starting the NATS server.

reproducer A stripped down version of our code which creates a stream and registers subjects to it and also adds message consumers to them.

How to build Simply do a mvn clean install -DskipTests in the parent folder. After that, go into reproducer and execute the TimeoutReproducerTest.bat which starts the supplied test 25 times. A couple of executions will success, others will fail with the mentioned timeout error. The log files of each test-run are placed under the ./logs folder.

scottf commented 11 months ago

@joachimglink Please see dm in slack.

scottf commented 2 months ago

I cannot run your example, but... my guess here is that this is an issue where you are making a management call in the handler for another message on the same connection. We've seen this before. The dispatcher makes a blocking call to post a message to a handler. The handler then makes an api call like a JetStreamManagement call that is a request under the covers. The request is made and the message comes in from the server, but the dispatcher is already busy delivering a message, so the 2nd request runs out of time.

The fix for this is the following: In the connection options add in

.useDispatcherWithExecutor()

This tells the dispatcher to run the delivery as a task from the Options executor service, which you can also supply with the option

.executor(ExecutorService)

Please let me know if this solves your issue. Otherwise, I will try again with your project, but I'm going to need to to run against a local server (not docker) with a more simplified project.

joachimglink commented 1 week ago

Sorry, that I didn´t respond earlier.

I tried the useDispatcherWithExecutor parameter and it seems to fix the timeout issues when sending messages. But we still experience the timeout issues when we try to register a message consumer or fetch information from Jetstream.

scottf commented 1 week ago

@joachimglink My guess is that you have some threading model that is trying to do something while handling another message and it ends up blocking. The useDispatcherWithExecutor might just be hiding the problem. The client generally works, but I see this from time to time. I had a customer with a the identical issue and we traced it down to his code did exactly as I just described. I suggested that they change that or maybe use another connection to do management like creating consumers.

Also, something I would suggest if you haven't already, is to get the latest library. We have done work to make sure all internal futures are properly terminated, something that ends up parking threads, which also may be a cause here.