Open nenych opened 6 months ago
Server 2.9.20 is now quite a while out of date, let us know how latest 2.10 works for you.
Below you can see the same test with the NATS 2.10.14, with this version we have even worse results:
Any updates here? We have the same issue on our cluster. Thanks
Hello, we still are observing this issue, when at least 1 slow consumer is detected we have up to 90% performance degradation. On the screenshot below we had 1 slow pod. Server version: 2.10.19-RC.3-alpine3.20
@nenych are you a Synadia customer?
@derekcollison No, I am not.
No worries, we will always do our best to help out the ecosystem. We do prioritize customers of course.
We would need to do a video call with you I think as a next step to really understand what is going on.
@derekcollison Sure, we can have a video call. Right now we have some test infrastructure where can show you the problem and our findings.
Will see if @wallyqs has some time to jump on a call.
Hi @nenych ping me at wally@nats.io
when you are available and can have a look.
I think it is just due to the detection of consumer(s) that are falling behind and the server stalls the fast producers. Running the server in Debug mode (-D) should show you messages similar to Timed out of fast producer stall (100ms)
. It affects the inbound of messages from producers on that subject (meaning the subject matching the slow consumer(s)). But it would not affect other producers that send to non slow consumers (well aside that there is a maximum that the server can handle so the perf per producer may decrease but overall inbound perf increase or be maintained).
That has always been the case (although we did tweak the stalling approach along the years).
Observed behavior
Performance degradation after the slow consumer connection. As you can see below, we are observing about 30% degradation of the incoming messages when the slow consumer connected, and about 50% after the second one.
Expected behavior
Stop sending messages to the slow consumers until their buffers are empty without slowing down the server.
Server and client version
Server: 2.9.20 Python library: nats-py 2.7.2
Host environment
Local: MacOS 14.4.1, arm64, Docker 26.0.0 The same behavior with the amd64 emulator (--platform=linux/amd64 flag).
GKE Container-Optimized OS, amd64, containerd
Steps to reproduce
Prepared required configs and docker-compose file that will start NATS, Prometheus, an exporter, and two consumers: https://github.com/nenych/nats-test.
Steps to run
Explore metrics