nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.91k stars 1.41k forks source link

Server panic on startup #4659

Closed aldiesel closed 10 months ago

aldiesel commented 1 year ago

What version were you using?

NATS 2.10.2

What environment was the server running in?

Kubernetes using NATS docker container

Is this defect reproducible?

Unsure how to reproduce.

Given the capability you are leveraging, describe your expectation?

No panic.

Given the expectation, what is the defect you are observing?

spinalcord [1] 2023/10/13 00:11:55.200334 [INF] Server is ready
spinalcord panic: runtime error: makeslice: cap out of range
spinalcord 
spinalcord goroutine 1 [running]:
spinalcord github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc001db51e0, {0x1079080, 0x0, 0x0})
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:4843 +0x1aa
spinalcord github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc001db51e0)
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:5225 +0x489
spinalcord github.com/nats-io/nats-server/v2/server.(*msgBlock).generatePerSubjectInfo(0xc001db51e0)
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:6678 +0x74
spinalcord github.com/nats-io/nats-server/v2/server.(*msgBlock).ensurePerSubjectInfoLoaded(0xa9ef80?)
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:6732 +0x85
spinalcord github.com/nats-io/nats-server/v2/server.(*fileStore).enforceMsgPerSubjectLimit(0xc000356300)
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:3254 +0x76c
spinalcord github.com/nats-io/nats-server/v2/server.newFileStoreWithCreated({{0xc00022fbf0, 0x2b}, 0x800000, 0x12a05f200, 0x1bf08eb000, 0x0, 0x0, 0x0, 0x1, 0xc000170d80}, ...)
spinalcord     github.com/nats-io/nats-server/v2/server/filestore.go:489 +0x10d4
spinalcord github.com/nats-io/nats-server/v2/server.(*stream).setupStore(0xc0000df500, 0xc0002a68a0)
spinalcord     github.com/nats-io/nats-server/v2/server/stream.go:3712 +0x489
spinalcord github.com/nats-io/nats-server/v2/server.(*Account).addStreamWithAssignment(0xc00017bb80, 0xc000142a18, 0x0, 0x0)
spinalcord     github.com/nats-io/nats-server/v2/server/stream.go:618 +0x148a
spinalcord github.com/nats-io/nats-server/v2/server.(*Account).addStream(...)
spinalcord     github.com/nats-io/nats-server/v2/server/stream.go:375
spinalcord github.com/nats-io/nats-server/v2/server.(*Account).EnableJetStream(0xc00017bb80, 0xc0000233b0)
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:1307 +0x3dd7
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).configJetStream(0xc000170d80, 0xc00017bb80)
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:707 +0xeb
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).configAllJetStreamAccounts(0xc000170d80)
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:768 +0x2e6
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).enableJetStreamAccounts(0xc00024a100?)
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:637 +0xc5
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).enableJetStream(0xc000170d80, {0x100000000, 0x200000000, {0xc000228c60, 0x1a}, 0x1bf08eb000, 0x0, {0xc000180784, 0x3}, 0x1, ...})
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:447 +0xa12
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).EnableJetStream(0xc000170d80, 0xc00020fe18)
spinalcord     github.com/nats-io/nats-server/v2/server/jetstream.go:221 +0x4a5
spinalcord github.com/nats-io/nats-server/v2/server.(*Server).Start(0xc000170d80)
spinalcord     github.com/nats-io/nats-server/v2/server/server.go:2236 +0x100d
spinalcord github.com/nats-io/nats-server/v2/server.Run(...)
spinalcord     github.com/nats-io/nats-server/v2/server/service.go:22
spinalcord main.main()
spinalcord     github.com/nats-io/nats-server/v2/main.go:127 +0x325
Stream closed EOF for redacted
neilalexander commented 1 year ago

Did this happen when going from an earlier version to 2.10.2, or were you running 2.10.2 already before the restart which caused this problem?

Is it still a problem with 2.10.3?

derekcollison commented 1 year ago

This looks like the server may have gotten OOM'd vs a server panic.

How big is the stream? Is it a KV?

How much memory do you limit to in docker? Do you set the soft mem env variable?

aldiesel commented 1 year ago

Mix of streams and KV streams. The main stream in use was 1 GB file limit and had 100 MB in the stream. We did notice that this leaf node was on the same account and the other stream was pumping 20k msg/s into its stream/domain.

GOMEMLIMIT is set yes. Was limited to 12 GB mem. though the MEMLIMIT may have been 3 GB. Good point.

We had to delete the whole file and restart the server fresh.

aldiesel commented 1 year ago

How much memory should we be dedicating at a minimum? Some rough rules would be great. Is it linearly proportional to stream size or there is some minimum that must be allocated? If a 1 GB stream can crash from needing 3 GB memory. What would we need for 1 TB for example?

Thank you for the quick response

aldiesel commented 1 year ago

Also I just double checked I believe our mem limit in this case was 6.4 GB (80% of 8GB)

derekcollison commented 1 year ago

I apologize, looking at the stack again it is a panic and that one I recognize and is fixed. Try 2.10.3.

For memory usage, we are constantly looking to improve the memory footprint of JetStream enabled servers. Currently we have a block architecture and a caching mechanism for them such that they are dropped when no longer active. This works well for streams where connumers usually traverse in a fifo order. For random access like KVs you can have more blocks loaded and hence we make those blocks smaller.

We are looking into ways to improve this such that the system could have a fixed allocation strategy, or a high bar for memory usage that we will try not to exceed.