nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.56k stars 1.39k forks source link

JetStream: panic: runtime error: makeslice: cap out of range #5236

Closed kylemcc closed 5 months ago

kylemcc commented 6 months ago

Observed behavior

NATS server (running only JetStream) is panicking when attempting to allocate a buffer here, in indexCacheBuf during compaction.

Stacktrace:

panic: runtime error: makeslice: cap out of range
goroutine 202019 [running]:
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc00616b040, {0xc01006e000, 0x5a5ca, 0x5a5ca})
github.com/nats-io/nats-server/v2/server/filestore.go:5242 +0x1d4
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc00616b040)
github.com/nats-io/nats-server/v2/server/filestore.go:5685 +0x489
github.com/nats-io/nats-server/v2/server.(*msgBlock).compact(0xc00616b040)
github.com/nats-io/nats-server/v2/server/filestore.go:3865 +0x99
github.com/nats-io/nats-server/v2/server.(*fileStore).syncBlocks(0xc0036fd500)
github.com/nats-io/nats-server/v2/server/filestore.go:5103 +0x35c
created by time.goFunc
time/sleep.go:176 +0x2d

Expected behavior

No panic :)

Server and client version

$ nats-server --version nats-server: v2.10.12

Host environment

Relevant NATS Config

Cluster size: 5

GOMEMLIMIT is set to 70% of the available RAM on the machine

This doesn't appear to be environment-specific, but sharing anyway:

Environment

Cloud: AWS Instance type: m7i-flex.xlarge vCPUs: 4 RAM: 16GiB Disk: 1 TiB EBS gp3 Container runtime: none

OS

Amazon Linux 2

# uname -a
Linux ip-172-16-30-97.us-west-2.compute.internal 4.14.334-252.552.amzn2.x86_64 #1 SMP Tue Jan 2 17:47:37 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
#  cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

Steps to reproduce

Not 100% sure how to trigger this, but I observed this repeatedly today while troubleshooting nodes that were recovering after an OOM (and were logging messages such as Stream state encountered internal inconsistency on recover). It's possible it's related to our usage patterns.

In case it's helpful, here's what our current usage looks like: we have a few "large" streams (all R=3, with MaxBytes set to 100 GiB) and a bunch of much smaller streams. Our sustained throughput is typically around 8-9MiB/s or around 750GiB/day for the cluster. This doesn't feel like a lot to me, but we have been seeing other reliability issues (such as frequent OOMs which often lead to corrupted streams, long restart/recovery times, corrupted consumers, stalled leader elections, etc).

derekcollison commented 6 months ago

Would it be possible to have a call with the Synadia Team?

If you create a backup of the stream and try to restore it somewhere else does the issue re-occur? If so would it be possible to get access to the backup?

derekcollison commented 6 months ago

Keeping open so that we can figure out how the system got into that state possibly.

kylemcc commented 5 months ago

Hey @derekcollison, thanks for the quick turnaround on this! Would be more than happy to hop on a call if it'd be helpful. I'd also be interested in discussing the occasional OOM issues we encounter and how we can troubleshoot and/or provide actionable info here.

In the meantime, I'll attempt a backup/restore to see if I can reliably reproduce the issue. If so, happy to share for troubleshooting purposes.

derekcollison commented 5 months ago

I would be interested in knowing about the OOMs also. Be good to get a memory profile from the server before it gets OOM'd.

tvojacek commented 5 months ago

It remind me old problems, in 2.9.3 https://github.com/nats-io/nats-server/issues/3517 I past there was problem with slow disks. We used GP2 drives, and at that time I was told that for proper work I need NVMe disk. I'm our case data were ending in $G stream. And sync with actual stream caused out of memory events. We stopped using jetstream to avoid issue.

derekcollison commented 5 months ago

The product has advanced quite a bit since 2.9.3, with 2.9 series at 2.9.25 and the current version 2.10 at 2.10.12..