nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.6k stars 1.39k forks source link

Stream performance decreases over time #3899

Closed jarretlavallee closed 1 year ago

jarretlavallee commented 1 year ago

Defect

While running nats bench on a stream, performance decreases over time. We see no increase in resource utilization while the messages per second decrease from ~11k to ~3k.

Versions of nats-server and affected client libraries used:

nightly-20230222

OS/Container environment:

Openshift running 3x nats nodes with JS. SSD storage. 10 CPU 36GB Memory GOMEMLIMIT=23GB

Steps or code to reproduce the issue:

On a new nats cluster with no other connections create the stream:

nats --timeout=30s str add Events --subjects="Events.>" --max-age="13h"
--discard="old" --dupe-window=1s --max-bytes=-1 --max-consumers=-1
--max-msgs=-1 --max-msg-size=-1 --max-msgs-per-subject=-1 --replicas=3
--retention="limits" --storage="file" --no-allow-rollup --no-deny-delete
--no-deny-purge

Create the consumer:

nats --timeout=30s consumer add --ack=explicit --replay=instant
--replicas=3 --max-deliver=-1 --deliver=all --max-pending=10000
--no-headers-only --backoff=none --filter="Events.*" --pull Events
zdataDurName

Run nats bench and monitor the MPS. After about 3 hours the MPS will be lower than the initial throughput. Note that the MaxAge is 13 hours and the performance impact is before that.

nats bench "Events" --stream="Events"
--consumer="zdataDurName" --pub=256 --sub=8 --pull
--size="1500" --syncpub --msgs=500000000 --multisubject
--consumerbatch=256 --pubsleep=20ms --purge
--multisubjectmax=65535 --no-progress --js

Expected result:

Performance should be sustained at a consistent level.

Actual result:

Performance decreases by as much as 60%

neilalexander commented 1 year ago

The 20230222 nightly was re-built about an hour ago with some fixes from today, please re-verify with the latest image with digest 74efb5dae2a1 if possible.

jarretlavallee commented 1 year ago

We tested on 74efb5dae2a1 and still saw the issue. Since then, we changed the design to use multiple streams at the 3K MPS performance level to accommodate the desired MPS on 2.9.15.

jnmoyne commented 1 year ago

May be the same thing as #3948