Open atombender opened 2 weeks ago
Also reproduced with NATS 2.10.19 and nats.go 1.37.0.
Also to add: It might be a client issue, but I opened it here because it's unclear.
to clarify the high cpu is on the app or the nats-server?
Both. What appears to be happening is that the pull is retried repeatedly, which increases the CPU both on the client and server sides.
With what frequency? Do you have any delay between retries? Maybe a backoff?
This is a single call to consumer.Consume()
, so I am not in control of the loop, NATS is. You may want to look at the repro code. 🙂
thanks for sharing the repro @atombender we'll take a look
Thanks for the repro, could hear from my laptop's fan speed that it's working :sweat_smile:
Seems the server is spammed with MSG.NEXT
requests that immediately fail due to the MaxBytes
being set, so it's essentially looping forever:
[#464089] Received on "$JS.API.CONSUMER.MSG.NEXT.my-stream.my-consumer" with reply "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
{"expires":300000000000,"batch":1000000,"max_bytes":1000,"idle_heartbeat":30000000000}
[#464090] Received on "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
Nats-Pending-Messages: 1000000
Nats-Pending-Bytes: 1000
Status: 409
Description: Message Size Exceeds MaxBytes
nil body
The client should at least log the error being hit, and should wait in-between retries probably.
But possibly this condition will never clear due to the message just staying, making the consumer stall. What should we do in this case/what is the intended behaviour?
Possibly the server could also protect itself more by not immediately sending the error back, which would also slow down the client.
The documentation is not clear about whether the max bytes limit is a hard or soft limit. It is apparent from the observed behavior that it is a hard limit that will prevent the consumer from consuming the next message for ever.
In other words, it's possible to write a consumer that just stops being able to consume (until the offending message is deleted/expired). That condition should at least be detectable at the consumer level.
It's debatable whether the client should retry, given that in a JetStream context it can't expect the next call to work — it's going to be stuck retrying until the blocking message is gone.
IMHO this should be moved to a nats.go issue rather than nats-server, the problem being with the Consume()
client-side code retrying to get the message.
How it should behave instead is open for discussion but I would say it should not retry and should instead signal the client app (which for Consume()
is not obvious how to do) or log the error. In my opinion when a message in the stream is larger than the NEXT request's max bytes, retrying will not do anything other than fail again (at least until the message in question gets removed from the stream). The issue is with the client application not specifying a max byte high enough (or if it's the message that's 'too large' the issue is more with the admin not setting a max message size in the stream's config).
Agreed.
We're having some ongoing discussions how to generally improve the max_bytes
behaviour in case of single message exceeding the batch config, but agree that this specific issue is client related. Moving it to nats.go
Observed behavior
In production, we noticed that NATS would periodically spike in CPU usage despite no sign of increased message volume or any other metric that seemed relevant.
We were able to narrow it down to setting
jetstream.PullMaxBytes()
with the pull consumer. If a message entered the stream that exceeded this size, the client would get a409 Message Size Exceeds MaxBytes
error from the server and apparently retry. Removing the max bytes limit fixed our issue.Also note that the error does not bubble up to the consumer's error handler callback. We were using
jetstream.ConsumeErrHandler()
to log errors. However, after a while this callback is called with the errornats: no heartbeat received
.Expected behavior
NATS should not use this much CPU.
Server and client version
Host environment
Linux, Kubernetes.
Steps to reproduce
Full reproduction here.
jetstream.PullMaxBytes()
passed toconsumer.Consume()
using a low maximum size.