nats-io / nats.go

Golang client for NATS, the cloud native messaging system.
https://nats.io
Apache License 2.0
5.46k stars 688 forks source link

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

Open atombender opened 2 weeks ago

atombender commented 2 weeks ago

Observed behavior

In production, we noticed that NATS would periodically spike in CPU usage despite no sign of increased message volume or any other metric that seemed relevant.

Screenshot 2024-09-13 at 19 50 48

We were able to narrow it down to setting jetstream.PullMaxBytes() with the pull consumer. If a message entered the stream that exceeded this size, the client would get a 409 Message Size Exceeds MaxBytes error from the server and apparently retry. Removing the max bytes limit fixed our issue.

Also note that the error does not bubble up to the consumer's error handler callback. We were using jetstream.ConsumeErrHandler() to log errors. However, after a while this callback is called with the error nats: no heartbeat received.

Expected behavior

NATS should not use this much CPU.

Server and client version

Host environment

Linux, Kubernetes.

Steps to reproduce

Full reproduction here.

  1. Start a consumer with jetstream.PullMaxBytes() passed to consumer.Consume() using a low maximum size.
  2. Publish a message exceeding the size.
atombender commented 2 weeks ago

Also reproduced with NATS 2.10.19 and nats.go 1.37.0.

atombender commented 2 weeks ago

Also to add: It might be a client issue, but I opened it here because it's unclear.

wallyqs commented 2 weeks ago

to clarify the high cpu is on the app or the nats-server?

atombender commented 2 weeks ago

Both. What appears to be happening is that the pull is retried repeatedly, which increases the CPU both on the client and server sides.

ripienaar commented 2 weeks ago

With what frequency? Do you have any delay between retries? Maybe a backoff?

atombender commented 2 weeks ago

This is a single call to consumer.Consume(), so I am not in control of the loop, NATS is. You may want to look at the repro code. 🙂

wallyqs commented 2 weeks ago

thanks for sharing the repro @atombender we'll take a look

MauriceVanVeen commented 2 weeks ago

Thanks for the repro, could hear from my laptop's fan speed that it's working :sweat_smile:

Seems the server is spammed with MSG.NEXT requests that immediately fail due to the MaxBytes being set, so it's essentially looping forever:

[#464089] Received on "$JS.API.CONSUMER.MSG.NEXT.my-stream.my-consumer" with reply "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
{"expires":300000000000,"batch":1000000,"max_bytes":1000,"idle_heartbeat":30000000000}

[#464090] Received on "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
Nats-Pending-Messages: 1000000
Nats-Pending-Bytes: 1000
Status: 409
Description: Message Size Exceeds MaxBytes

nil body

The client should at least log the error being hit, and should wait in-between retries probably.

But possibly this condition will never clear due to the message just staying, making the consumer stall. What should we do in this case/what is the intended behaviour?

MauriceVanVeen commented 2 weeks ago

Possibly the server could also protect itself more by not immediately sending the error back, which would also slow down the client.

atombender commented 2 weeks ago

The documentation is not clear about whether the max bytes limit is a hard or soft limit. It is apparent from the observed behavior that it is a hard limit that will prevent the consumer from consuming the next message for ever.

In other words, it's possible to write a consumer that just stops being able to consume (until the offending message is deleted/expired). That condition should at least be detectable at the consumer level.

It's debatable whether the client should retry, given that in a JetStream context it can't expect the next call to work — it's going to be stuck retrying until the blocking message is gone.

jnmoyne commented 1 week ago

IMHO this should be moved to a nats.go issue rather than nats-server, the problem being with the Consume() client-side code retrying to get the message.

How it should behave instead is open for discussion but I would say it should not retry and should instead signal the client app (which for Consume() is not obvious how to do) or log the error. In my opinion when a message in the stream is larger than the NEXT request's max bytes, retrying will not do anything other than fail again (at least until the message in question gets removed from the stream). The issue is with the client application not specifying a max byte high enough (or if it's the message that's 'too large' the issue is more with the admin not setting a max message size in the stream's config).

atombender commented 1 week ago

Agreed.

Jarema commented 5 days ago

We're having some ongoing discussions how to generally improve the max_bytes behaviour in case of single message exceeding the batch config, but agree that this specific issue is client related. Moving it to nats.go