nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.56k stars 1.39k forks source link

Slowness in 2.10 with stream with many subjects #5040

Closed cjbottaro closed 7 months ago

cjbottaro commented 7 months ago

Observed behavior

After upgrading to 2.10, our consumer's throughput drastically decreased from processing hundreds of messages per second, to just tens of messages per second.

Our stream has over 1 million subjects. We use it for change events. The subjects looks like:

change-stream.<source>.<tenant>
change-stream.<source>.<tenant>.<table>.<pkey>

So a concrete subject might look like this:

change-stream.postgres.1           # entire transaction
change-stream.postgres.1.users.123 # single row change

And the idea is that our consumers could say things like

Prior to upgrading to 2.10, we had just a single consumer with a singular filter_subject:

change-stream.*.*

After the upgrade, we added a new consumer with plural filter_subjects:

change-stream.postgres.*.reviewer_groups.*
change-stream.postgres.*.reviewer_group_members.*
change-stream.postgres.*.reviewer_group_assignments.*
change-stream.postgres.*.qualification_groups.*
change-stream.postgres.*.qualifications.*
change-stream.postgres.*.opportunities
change-stream.postgres.*.applications.*
change-stream.postgres.*.categories.*
change-stream.direct.*.cassandra-applications.*

And also added a few dozen ephemeral consumer with singular filter_subject:

change-stream.postgres.0.clients.*

We see the throughput drop in Nats Prometheus exporter data: 2024-02-02-17-53-38

And we also see the drop in application level monitoring.

We thought it was due to the extra consumers with all the extra filtering, so we deleted them all except for the one consumer from prior the deploy, but the low throughput still persists.

Kind of interesting, throughput does seem to scale by increasing application processes, which are just doing this:

while true
  msg = consumer_msg_next(batch: 1)
  handle_msg(msg)
  ack(msg)

But again, prior to the upgrade, a single pod was processing 500 msg/s, and now it's down to around 20 msg/s.

We even tried noop'ing handle_msg(msg), but it does not increase throughput at all.

Maybe this is related? https://github.com/nats-io/nats-server/issues/4888

Expected behavior

A single consumer be able to handle higher throughput.

Server and client version

2.10.9

Host environment

AWS, Linux

Steps to reproduce

No response

derekcollison commented 7 months ago

Do you similar results with 2.10.10, the latest?

What version were you running?

cjbottaro commented 7 months ago

We just upgraded to 2.10.10 and it didn't change anything.

Our last few upgrades have been...

2.9.21
2.9.24
2.10.7
2.10.9
2.10.10

We think the slowness happened when going to 2.10. When we upgraded, we also added a few dozen ephemeral consumers using filter_subject (singular), and one durable consumer using like 10 filter_subjects (plural). But we have since deleted those.

Did anything change with the protocol? We use our own Nats lib and are making v2 requests to the Jetstream APIs.

It's kind of curious that we're seeing almost exactly 40 ack/s out of two pods with concurrency set to 20.

Is there anything that limits consumer_msg_next / consumer_msg_ack on a single connection? Our pods just make a single connection to Nats server, then loop calling consumer_msg_next and handling the message on a thread. The thread then calls consumer_msg_ack. We limit our worker threads to 20.

Edit: We also tried noop'ing the handling of the message and just immediately ack'ing it to verify the slowness is not in our app code.

Thanks for the help.

derekcollison commented 7 months ago

I think separate consumers might be good for now, we know with many filter subjects there is a performance hit and we are working on some improvements.

cjbottaro commented 7 months ago

We deleted all the new consumers that we made and we're back to our original single stream and single consumer that we had before upgrading to 2.10, and the slowness persists. We could try downgrading to 2.9 again.

wallyqs commented 7 months ago

thanks for the updates @cjbottaro, any chance that it could be related to the client version (I think ruby in this case)?

Jarema commented 7 months ago

I think, if possible, it might be beneficial to test those consumers with NATS CLI bench tool to be sure if it's a server issue or the client issue.

cjbottaro commented 7 months ago

I think, if possible, it might be beneficial to test those consumers with NATS CLI bench tool to be sure if it's a server issue or the client issue.

Ahh, that's a good idea.

Some more info: We definitely think it's a server issue because we've had two outages recently. A node in the cluster will just start timing out all requests sent to it and we get this NATS::IO::SocketTimeoutError (Ruby client). Restarting the node fixes the issue.

derekcollison commented 7 months ago

@cjbottaro are you a Synadia customer? We would want to jump on a call and try to triage the system directly. What you are experiencing is not common from our other users and customers.

cjbottaro commented 7 months ago

We figured it out. We were abusing v1 requests with the consumer_msg_next endpoint in order to multiplex blocking style requests on a single nats connection. It seems that 2.10 is sensitive to that (maybe the constant creating/deleting of subs), because it didn't seem to be an issue with 2.9.

It would be neat if consumer_msg_next added the reply_to in a header or so that we could use it with v2 style requests, but I understand that doesn't jive well with batch_size > 1.

Sorry for the noise!

are you a Synadia customer?

We're working on that! Hopefully in the next year or two, we can figure out our budget.

derekcollison commented 7 months ago

Which client are you using again?

cjbottaro commented 7 months ago

Our own Elixir client. Along with nats-pure for Ruby and the official Go client as well.

derekcollison commented 7 months ago

Might be good to sync with @Jarema and @piotrpio on the low level details of how to interact with the server for pulls.

Jarema commented 7 months ago

@cjbottaro Definately! Reach us on NATS Slack.