Closed cjbottaro closed 7 months ago
Do you similar results with 2.10.10, the latest?
What version were you running?
We just upgraded to 2.10.10 and it didn't change anything.
Our last few upgrades have been...
2.9.21
2.9.24
2.10.7
2.10.9
2.10.10
We think the slowness happened when going to 2.10. When we upgraded, we also added a few dozen ephemeral consumers using filter_subject
(singular), and one durable consumer using like 10 filter_subjects
(plural). But we have since deleted those.
Did anything change with the protocol? We use our own Nats lib and are making v2 requests to the Jetstream APIs.
It's kind of curious that we're seeing almost exactly 40 ack/s out of two pods with concurrency set to 20.
Is there anything that limits consumer_msg_next
/ consumer_msg_ack
on a single connection? Our pods just make a single connection to Nats server, then loop calling consumer_msg_next
and handling the message on a thread. The thread then calls consumer_msg_ack
. We limit our worker threads to 20.
Edit: We also tried noop'ing the handling of the message and just immediately ack'ing it to verify the slowness is not in our app code.
Thanks for the help.
I think separate consumers might be good for now, we know with many filter subjects there is a performance hit and we are working on some improvements.
We deleted all the new consumers that we made and we're back to our original single stream and single consumer that we had before upgrading to 2.10, and the slowness persists. We could try downgrading to 2.9 again.
thanks for the updates @cjbottaro, any chance that it could be related to the client version (I think ruby in this case)?
I think, if possible, it might be beneficial to test those consumers with NATS CLI bench tool to be sure if it's a server issue or the client issue.
I think, if possible, it might be beneficial to test those consumers with NATS CLI bench tool to be sure if it's a server issue or the client issue.
Ahh, that's a good idea.
Some more info: We definitely think it's a server issue because we've had two outages recently. A node in the cluster will just start timing out all requests sent to it and we get this NATS::IO::SocketTimeoutError
(Ruby client). Restarting the node fixes the issue.
@cjbottaro are you a Synadia customer? We would want to jump on a call and try to triage the system directly. What you are experiencing is not common from our other users and customers.
We figured it out. We were abusing v1 requests with the consumer_msg_next
endpoint in order to multiplex blocking style requests on a single nats connection. It seems that 2.10 is sensitive to that (maybe the constant creating/deleting of subs), because it didn't seem to be an issue with 2.9.
It would be neat if consumer_msg_next
added the reply_to in a header or so that we could use it with v2 style requests, but I understand that doesn't jive well with batch_size > 1.
Sorry for the noise!
are you a Synadia customer?
We're working on that! Hopefully in the next year or two, we can figure out our budget.
Which client are you using again?
Our own Elixir client. Along with nats-pure
for Ruby and the official Go client as well.
Might be good to sync with @Jarema and @piotrpio on the low level details of how to interact with the server for pulls.
@cjbottaro Definately! Reach us on NATS Slack.
Observed behavior
After upgrading to 2.10, our consumer's throughput drastically decreased from processing hundreds of messages per second, to just tens of messages per second.
Our stream has over 1 million subjects. We use it for change events. The subjects looks like:
So a concrete subject might look like this:
And the idea is that our consumers could say things like
change-stream.postgres.*.users.*
change-stream.postgres.5.users.123
change-stream.postgres.3
Prior to upgrading to 2.10, we had just a single consumer with a singular
filter_subject
:After the upgrade, we added a new consumer with plural
filter_subjects
:And also added a few dozen ephemeral consumer with singular
filter_subject
:We see the throughput drop in Nats Prometheus exporter data:
And we also see the drop in application level monitoring.
We thought it was due to the extra consumers with all the extra filtering, so we deleted them all except for the one consumer from prior the deploy, but the low throughput still persists.
Kind of interesting, throughput does seem to scale by increasing application processes, which are just doing this:
But again, prior to the upgrade, a single pod was processing 500 msg/s, and now it's down to around 20 msg/s.
We even tried noop'ing
handle_msg(msg)
, but it does not increase throughput at all.Maybe this is related? https://github.com/nats-io/nats-server/issues/4888
Expected behavior
A single consumer be able to handle higher throughput.
Server and client version
2.10.9
Host environment
AWS, Linux
Steps to reproduce
No response