nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.99k stars 1.41k forks source link

Performance degradation with slow consumers [v2.9.20, v2.10.14] #5394

Open nenych opened 6 months ago

nenych commented 6 months ago

Observed behavior

Performance degradation after the slow consumer connection. As you can see below, we are observing about 30% degradation of the incoming messages when the slow consumer connected, and about 50% after the second one.

CleanShot 2024-05-07 at 16 58 44@2x

Expected behavior

Stop sending messages to the slow consumers until their buffers are empty without slowing down the server.

Server and client version

Server: 2.9.20 Python library: nats-py 2.7.2

Host environment

Local: MacOS 14.4.1, arm64, Docker 26.0.0 The same behavior with the amd64 emulator (--platform=linux/amd64 flag).

GKE Container-Optimized OS, amd64, containerd

Steps to reproduce

Prepared required configs and docker-compose file that will start NATS, Prometheus, an exporter, and two consumers: https://github.com/nenych/nats-test.

Steps to run

  1. Clone the repository.
  2. Build the docker image:
    docker build -t test/nats:latest .
  3. Install NATS cli: https://docs.nats.io/using-nats/nats-tools/nats_cli
  4. Run docker-compose (will start NATS, prometheus and 2 consumers):
    docker-compose -f ./docker-compose.yaml up -d
  5. Start NATS benchmark:
    nats bench updates --pub=4 --msgs 1000000000 --size=1000
  6. Wait a little and start the slow consumer:
    docker run --rm -it --network=nats-test_default test/nats:latest python3 slow-consumer.py

Explore metrics

  1. Open prometheus: http://localhost:9091/graph
  2. Insert query:
    sum by (job) (rate(nats_varz_in_msgs[30s]))
ripienaar commented 6 months ago

Server 2.9.20 is now quite a while out of date, let us know how latest 2.10 works for you.

nenych commented 6 months ago

Below you can see the same test with the NATS 2.10.14, with this version we have even worse results:

CleanShot 2024-05-07 at 18 36 04@2x

kam1kaze commented 5 months ago

Any updates here? We have the same issue on our cluster. Thanks

nenych commented 1 month ago

Hello, we still are observing this issue, when at least 1 slow consumer is detected we have up to 90% performance degradation. On the screenshot below we had 1 slow pod. Server version: 2.10.19-RC.3-alpine3.20 CleanShot 2024-09-30 at 11 07 55@2x

derekcollison commented 1 month ago

@nenych are you a Synadia customer?

nenych commented 1 month ago

@derekcollison No, I am not.

derekcollison commented 1 month ago

No worries, we will always do our best to help out the ecosystem. We do prioritize customers of course.

We would need to do a video call with you I think as a next step to really understand what is going on.

nenych commented 1 month ago

@derekcollison Sure, we can have a video call. Right now we have some test infrastructure where can show you the problem and our findings.

derekcollison commented 1 month ago

Will see if @wallyqs has some time to jump on a call.

wallyqs commented 1 month ago

Hi @nenych ping me at wally@nats.io when you are available and can have a look.

kozlovic commented 1 month ago

I think it is just due to the detection of consumer(s) that are falling behind and the server stalls the fast producers. Running the server in Debug mode (-D) should show you messages similar to Timed out of fast producer stall (100ms). It affects the inbound of messages from producers on that subject (meaning the subject matching the slow consumer(s)). But it would not affect other producers that send to non slow consumers (well aside that there is a maximum that the server can handle so the perf per producer may decrease but overall inbound perf increase or be maintained).

That has always been the case (although we did tweak the stalling approach along the years).