Closed kylemcc closed 5 months ago
Would it be possible to have a call with the Synadia Team?
If you create a backup of the stream and try to restore it somewhere else does the issue re-occur? If so would it be possible to get access to the backup?
Keeping open so that we can figure out how the system got into that state possibly.
Hey @derekcollison, thanks for the quick turnaround on this! Would be more than happy to hop on a call if it'd be helpful. I'd also be interested in discussing the occasional OOM issues we encounter and how we can troubleshoot and/or provide actionable info here.
In the meantime, I'll attempt a backup/restore to see if I can reliably reproduce the issue. If so, happy to share for troubleshooting purposes.
I would be interested in knowing about the OOMs also. Be good to get a memory profile from the server before it gets OOM'd.
It remind me old problems, in 2.9.3 https://github.com/nats-io/nats-server/issues/3517 I past there was problem with slow disks. We used GP2 drives, and at that time I was told that for proper work I need NVMe disk. I'm our case data were ending in $G stream. And sync with actual stream caused out of memory events. We stopped using jetstream to avoid issue.
The product has advanced quite a bit since 2.9.3, with 2.9 series at 2.9.25 and the current version 2.10 at 2.10.12..
Observed behavior
NATS server (running only JetStream) is panicking when attempting to allocate a buffer here, in
indexCacheBuf
during compaction.Stacktrace:
Expected behavior
No panic :)
Server and client version
$ nats-server --version nats-server: v2.10.12
Host environment
Relevant NATS Config
Cluster size: 5
GOMEMLIMIT
is set to 70% of the available RAM on the machineThis doesn't appear to be environment-specific, but sharing anyway:
Environment
Cloud: AWS Instance type:
m7i-flex.xlarge
vCPUs: 4 RAM: 16GiB Disk: 1 TiB EBS gp3 Container runtime: noneOS
Amazon Linux 2
Steps to reproduce
Not 100% sure how to trigger this, but I observed this repeatedly today while troubleshooting nodes that were recovering after an OOM (and were logging messages such as
Stream state encountered internal inconsistency on recover
). It's possible it's related to our usage patterns.In case it's helpful, here's what our current usage looks like: we have a few "large" streams (all R=3, with
MaxBytes
set to 100 GiB) and a bunch of much smaller streams. Our sustained throughput is typically around 8-9MiB/s or around 750GiB/day for the cluster. This doesn't feel like a lot to me, but we have been seeing other reliability issues (such as frequent OOMs which often lead to corrupted streams, long restart/recovery times, corrupted consumers, stalled leader elections, etc).