nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16k stars 1.41k forks source link

Severe request/reply performance hit when using allow_responses map #6058

Open jack7803m opened 3 weeks ago

jack7803m commented 3 weeks ago

Observed behavior

When using the allow_responses map in user permissions with max set, request/reply takes a significant performance hit depending on the value of expires. If the max value is set but the expires value is unset, performance continually degrades over time. Even if using expires: "10s", the request/reply throughput drops by about 70% (on my machine).

Not sure if there's any way around this - although I haven't looked into the server code, I would imagine having to cache the request subjects for the max reply count indefinitely would cause memory usage to creep up and performance to go down over time. That said, I think this should really be either fixed (if possible) or at least documented so that it's clear that using the allow_responses map will have an impact on request/reply performance. It might also be a good idea to just not allow specifying max without an expiry, as it seems like this will always lead to a memory leak.

Expected behavior

Performance should not degrade or it should be documented

Server and client version

v2.10.20

Host environment

Tested with the latest docker image on an ubuntu server and separately with the server binary in a WSL ubuntu environment

Steps to reproduce

nats.conf with no issues:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">" }}
  ]
}

nats.conf with worse performance:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1, expires: "10s" } }}
  ]
}

nats.conf with memory leak:

host: 0.0.0.0

authorization {
  users = [
    {user: test, password: test, permissions: { publish: ">", subscribe: ">", allow_responses: { max: 1 } }}
  ]
}

Using nats cli (nats) and nats server (nats-server) binaries in three terminals:

./nats-server -c nats.conf ./nats bench test --sub 1 --reply --user test --password test ./nats bench test --pub 1 --request --user test --password test

jack7803m commented 3 weeks ago

Spent some time tracing back to where the response permissions are actually used and as I expected, the replies are stored in a map.

https://github.com/nats-io/nats-server/blob/1ee2b8a11af89c8c2c2281105bee851e53250771/server/client.go#L3635C1-L3642C3

Looks like the major performance problem (tested this locally with the CPU profiler too) is coming from client.pruneReplyPerms() because it loops over the entire map and it gets called way too often. Maybe it should be called once every replyPermLimit messages instead of every message when the map length is bigger than replyPermLimit? Testing that change locally seemed to have a massive performance improvement, but not sure how it would affect memory usage on a system that's not getting messages all the time. Maybe it would be better to check based on a time interval?

I'm quite inexperienced with this repository, so if I could get a sanity check from any of the regular maintainers here that would be great. Planning on submitting a PR with a fix tomorrow, probably using a combination of the message count and time interval checking.

neilalexander commented 3 weeks ago

I think you're right in that there's a problem with how these entries are cleaned up.

If there's no expiry timestamp, then a lack of response can effectively mean the entries live forever. Likewise, if there's a max response count and we don't get that many responses, they too can live forever.

It is my feeling that any configuration with no expires should be invalid, so we should either maybe error in that case or set a default expires amount.

We might also want to schedule a goroutine to clean up on a regular schedule instead of on each call to deliverMsg, otherwise the work to clean up the map just compounds.

Will have a think about it a bit more.