Open jack7803m opened 3 weeks ago
Spent some time tracing back to where the response permissions are actually used and as I expected, the replies are stored in a map.
Looks like the major performance problem (tested this locally with the CPU profiler too) is coming from client.pruneReplyPerms()
because it loops over the entire map and it gets called way too often. Maybe it should be called once every replyPermLimit
messages instead of every message when the map length is bigger than replyPermLimit
? Testing that change locally seemed to have a massive performance improvement, but not sure how it would affect memory usage on a system that's not getting messages all the time. Maybe it would be better to check based on a time interval?
I'm quite inexperienced with this repository, so if I could get a sanity check from any of the regular maintainers here that would be great. Planning on submitting a PR with a fix tomorrow, probably using a combination of the message count and time interval checking.
I think you're right in that there's a problem with how these entries are cleaned up.
If there's no expiry timestamp, then a lack of response can effectively mean the entries live forever. Likewise, if there's a max response count and we don't get that many responses, they too can live forever.
It is my feeling that any configuration with no expires
should be invalid, so we should either maybe error in that case or set a default expires amount.
We might also want to schedule a goroutine to clean up on a regular schedule instead of on each call to deliverMsg
, otherwise the work to clean up the map just compounds.
Will have a think about it a bit more.
Observed behavior
When using the
allow_responses
map in user permissions withmax
set, request/reply takes a significant performance hit depending on the value ofexpires
. If themax
value is set but theexpires
value is unset, performance continually degrades over time. Even if usingexpires: "10s"
, the request/reply throughput drops by about 70% (on my machine).Not sure if there's any way around this - although I haven't looked into the server code, I would imagine having to cache the request subjects for the max reply count indefinitely would cause memory usage to creep up and performance to go down over time. That said, I think this should really be either fixed (if possible) or at least documented so that it's clear that using the
allow_responses
map will have an impact on request/reply performance. It might also be a good idea to just not allow specifyingmax
without an expiry, as it seems like this will always lead to a memory leak.Expected behavior
Performance should not degrade or it should be documented
Server and client version
v2.10.20
Host environment
Tested with the latest docker image on an ubuntu server and separately with the server binary in a WSL ubuntu environment
Steps to reproduce
nats.conf with no issues:
nats.conf with worse performance:
nats.conf with memory leak:
Using nats cli (
nats
) and nats server (nats-server
) binaries in three terminals:./nats-server -c nats.conf
./nats bench test --sub 1 --reply --user test --password test
./nats bench test --pub 1 --request --user test --password test