panic: runtime error: makeslice: cap out of range

ajlane commented 2 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

[x] Included nats-server -DV output
[x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

[54] 2022/05/10 03:53:10.909500 [INF] Starting nats-server
[54] 2022/05/10 03:53:10.909527 [INF]   Version:  2.8.0
[54] 2022/05/10 03:53:10.909536 [INF]   Git:      [90721ee]
[54] 2022/05/10 03:53:10.909544 [DBG]   Go build: go1.17.9
[54] 2022/05/10 03:53:10.909553 [INF]   Name:     NCFLZXZ6MUOIW6H4BRRNMQMOHFYAENP6OEU3XUYYKU3WHHE4ZRDPFRBS
[54] 2022/05/10 03:53:10.909572 [INF]   ID:       NCFLZXZ6MUOIW6H4BRRNMQMOHFYAENP6OEU3XUYYKU3WHHE4ZRDPFRBS
[54] 2022/05/10 03:53:10.909595 [DBG] Created system account: "$SYS"

Messages were sent and consumed using nats.py 2.1.0.

OS/Container environment:

GKE

Steps or code to reproduce the issue:

Install NATS with Helm, enable jetstream filestorage -- Stored on a volume with a storage class type=pd-ssd and WaitForFirstConsumer
Send tens of thousands of messages without consuming them.
Attempt to scale down the node that nats is running on for an upgrade and recover on a new node.

Expected result:

Not to panic, to recover gracefully, or at least to provide a more instructive error message.

Actual result:

[3426] 2022/05/10 03:10:51.017008 [INF] Starting nats-server
[3426] 2022/05/10 03:10:51.017092 [INF]   Version:  2.8.0
[3426] 2022/05/10 03:10:51.017096 [INF]   Git:      [90721ee]
[3426] 2022/05/10 03:10:51.017100 [INF]   Name:     queue-nats-0
[3426] 2022/05/10 03:10:51.017110 [INF]   Node:     4pcPsiNg
[3426] 2022/05/10 03:10:51.017116 [INF]   ID:       NC6TCJ2D5KDDFEJ37JNOLTZIBEDERC6T3MMGN6ZMVNHDGFW6IM3ETFIJ
[3426] 2022/05/10 03:10:51.017146 [INF] Using configuration file: /etc/nats-config/nats.conf
[3426] 2022/05/10 03:10:51.018272 [INF] Starting http monitor on 0.0.0.0:8222
[3426] 2022/05/10 03:10:51.018355 [INF] Starting JetStream
[3426] 2022/05/10 03:10:51.018534 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[3426] 2022/05/10 03:10:51.018538 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[3426] 2022/05/10 03:10:51.018540 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[3426] 2022/05/10 03:10:51.018541 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[3426] 2022/05/10 03:10:51.018543 [INF] 
[3426] 2022/05/10 03:10:51.018545 [INF]          https://docs.nats.io/jetstream
[3426] 2022/05/10 03:10:51.018546 [INF] 
[3426] 2022/05/10 03:10:51.018548 [INF] ---------------- JETSTREAM ----------------
[3426] 2022/05/10 03:10:51.018556 [INF]   Max Memory:      1.00 GB
[3426] 2022/05/10 03:10:51.018559 [INF]   Max Storage:     100.00 GB
[3426] 2022/05/10 03:10:51.018561 [INF]   Store Directory: "/data/jetstream"
[3426] 2022/05/10 03:10:51.018562 [INF] -------------------------------------------
[3426] 2022/05/10 03:10:51.321612 [INF]   Restored 13,833 messages for stream '$G > INGEST'
[3426] 2022/05/10 03:10:51.421122 [INF] Server is ready
panic: runtime error: makeslice: cap out of range

goroutine 1 [running]:
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc000292a80, {0xc00078c000, 0x186a22, 0x200000})
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3165 +0x195
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc000292a80)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3466 +0x414
github.com/nats-io/nats-server/v2/server.(*msgBlock).generatePerSubjectInfo(0xc000292a80)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:4663 +0xf0
github.com/nats-io/nats-server/v2/server.(*msgBlock).readPerSubjectInfo(0xc000292a80)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:4716 +0x3b4
github.com/nats-io/nats-server/v2/server.(*fileStore).recoverMsgBlock(0xc000290280, {0xb1de38, 0xc000709790}, 0x3d)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:673 +0xff0
github.com/nats-io/nats-server/v2/server.(*fileStore).recoverMsgs(0xc000290280)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:958 +0x8bd
github.com/nats-io/nats-server/v2/server.newFileStoreWithCreated({{0xc000220990, 0x25}, _, _, _, _}, {{0xc000759020, 0xa}, {0x0, 0x0}, ...}, ...)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:315 +0x76c
github.com/nats-io/nats-server/v2/server.(*stream).setupStore(0xc00011c2c0, 0xc000160988)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:2760 +0x2d5
github.com/nats-io/nats-server/v2/server.(*Account).addStreamWithAssignment(0xc000118900, 0xc0000f6108, 0x0, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:417 +0xee5
github.com/nats-io/nats-server/v2/server.(*Account).addStream(0xc000759030, 0xc000758ff0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:276 +0x1d
github.com/nats-io/nats-server/v2/server.(*Account).EnableJetStream(0xc000118900, 0xc00001f8c0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:1193 +0x2ecd
github.com/nats-io/nats-server/v2/server.(*Server).configJetStream(0xc0001ac000, 0xc000118900)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:645 +0x9c
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStreamAccounts(0xc0000d0160)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:575 +0xb4
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStream(0xc0001ac000, {0x40000000, 0x1900000000, {0xc0000c23e0, 0xf}, {0x0, 0x0}})
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:392 +0x919
github.com/nats-io/nats-server/v2/server.(*Server).EnableJetStream(0xc0001ac000, 0xc000161e38)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:204 +0x40c
github.com/nats-io/nats-server/v2/server.(*Server).Start(0xc0001ac000)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:1746 +0xf10
github.com/nats-io/nats-server/v2/server.Run(...)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/service.go:22
main.main()
        /home/travis/gopath/src/github.com/nats-io/nats-server/main.go:118 +0x2fa

derekcollison commented 2 years ago

Thanks, will take a look and make sure its either already fixed or will be in next release.

derekcollison commented 2 years ago

What hardware we you running on? Does it crash everytime you start the server? Would you be willing to share your storage directory so we can reproduce here?

I get what could be happening but curious which of the two theories are correct for your use case.

Thanks.

ajlane commented 2 years ago

The node was an n1-standard-8. It was in a crash loop that never recovered.

I don't have the original storage directory, because I've since re-installed and restarted everything - I'll try to reproduce it.

One thing I failed to mention is that I am also updating a small number of values in a kvstore for each message, in case that's relevant.

derekcollison commented 2 years ago

ok thanks. Might be good top upgrade to the latest. We are stabilizing quickly but the use cases for the tech are so diverse and the usage so high that we are going as fast as we can to address any issues.

ajlane commented 2 years ago

I reproduced the error. This is the contents of the data folder: jetstream.zip

derekcollison commented 2 years ago

That's super helpful. Thank you. Will jump on this tomorrow and keep you posted.

derekcollison commented 2 years ago

The index file seems corrupt. What steps did you take to reproduce, and just double checking this is server version 2.8.2 yes?

derekcollison commented 2 years ago

Also for business continuity, if you remove the index file for blk 63, so 63.idx, the system should repair itself. I am trying to hunt down where we corrupt the idx file. It's not a filesystem thing, we wrap negative for msgs and bytes is what I believe.

derekcollison commented 2 years ago

Do you use erase vs delete for removing messages?

derekcollison commented 2 years ago

Also do you know if you call Truncate()?

ajlane commented 2 years ago

Interesting.

The reproduced error was also on 2.8.0, because that is the latest version installed by helm. I will attempt to replicate with 2.8.2.

I don't remove any messages, as far as I know. Maybe nats.py does something surprising. I subscribe to a handful of queues from a swarm of workers. Some messages take quite some time to be ack'd, but otherwise I don't think there's anything complicated happening.

To reproduce, I just re-installed from scratch and deployed all of my consumers. After a couple of hours, I shut down the node to force it to move to a new node. I suspect this might be a red herring though, the server could have been in a bad state before I shut down the node, I'll try to be more observant the next time around.

I'll also look at trying to produce a cut-down version of my nats.py code to share.

derekcollison commented 2 years ago

No worries, I think I have a clue. Working on a test case now.

ajlane commented 2 years ago

Unfortunately, I can confirm that the error persists in 2.8.2, even if I leave the node alone.

derekcollison commented 2 years ago

So a restart is not needed? Or is your system still doing automated server shutdowns on its on?

ajlane commented 2 years ago

I've just had a chance to look through the logs to see what happened.

The server was under-resourced. I wasn't setting resource.requests and resource.limits with helm, and so it was receiving only token amounts of cpu and memory despite having plenty of available resources on the node. Because the server is so light and efficient, this wasn't immediately obvious.

The process was suddenly terminating and restarting periodically, with no particular errors being logged. It sucessfully restored itself many times before we began to see the panic. Once that happened, the server consistently panicked on startup.

In case it's relevant, there are also a lot of these messages being logged:

[7] 2022/05/12 07:38:47.720579 [INF] 10.56.137.7:45998 - cid:993 - "v2.1.0:python3" - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 2533 total bytes.

These are not particularly surprising to me, I know I have some messages that take a long time to process.

derekcollison commented 2 years ago

I am still looking for the edge condition which triggers the negative roll around in accounting. Have not found it yet but I will, also already have code to detect on restart and fix on the fly once detected. Thanks again for your patience.

derekcollison commented 2 years ago

Would it be possible to run a nightly build when it is staged to see if that resolves it?

derekcollison commented 2 years ago

If you could give the nightly docker image, under synadia/nats-server:nightly a shot and let us know if the problem still repeats. Thanks again for the info.

ajlane commented 2 years ago

After four hours of running the nightly image, the server has not restarted on its own at all. Manual restarts recover fine.

derekcollison commented 2 years ago

Thanks for the feedback, much appreciated. And thanks for your patience. We will most likely do a 2.8.3 release early next week.

zakk616 commented 4 months ago

Facing the same problem:

when i start nats-server with cluster mode, command:

nats-server -js -c node1.config

output:

[357] 2024/05/07 16:37:33.792265 [INF] Starting nats-server
[357] 2024/05/07 16:37:33.792649 [INF]   Version:  2.10.0
[357] 2024/05/07 16:37:33.792660 [INF]   Git:      [77e1983]
[357] 2024/05/07 16:37:33.792670 [INF]   Cluster:  nats-cluster
[357] 2024/05/07 16:37:33.792676 [INF]   Name:     payhub-nats-cluster1-node1
[357] 2024/05/07 16:37:33.792686 [INF]   Node:     D72AFbUz
[357] 2024/05/07 16:37:33.792692 [INF]   ID:       NDFJBQANKO32UMWKKVZWQDNQXJVGRRTS234ZDAAGOBMP4CE75JFM6Y55
[357] 2024/05/07 16:37:33.792811 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[357] 2024/05/07 16:37:33.792841 [INF] Using configuration file: node1.config
[357] 2024/05/07 16:37:33.793512 [INF] Starting JetStream
[357] 2024/05/07 16:37:33.794490 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[357] 2024/05/07 16:37:33.794506 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[357] 2024/05/07 16:37:33.794511 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[357] 2024/05/07 16:37:33.794519 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[357] 2024/05/07 16:37:33.794529 [INF]
[357] 2024/05/07 16:37:33.794538 [INF]          https://docs.nats.io/jetstream
[357] 2024/05/07 16:37:33.794544 [INF]
[357] 2024/05/07 16:37:33.794549 [INF] ---------------- JETSTREAM ----------------
[357] 2024/05/07 16:37:33.794559 [INF]   Max Memory:      11.91 GB
[357] 2024/05/07 16:37:33.794567 [INF]   Max Storage:     82.85 MB
[357] 2024/05/07 16:37:33.794573 [INF]   Store Directory: "/tmp/nats/jetstream"
[357] 2024/05/07 16:37:33.794583 [INF] -------------------------------------------
[357] 2024/05/07 16:37:33.796113 [INF]   Starting restore for stream 'USERS > 103'
[357] 2024/05/07 16:37:33.798827 [INF]   Restored 0 messages for stream 'USERS > 103' in 3ms
[357] 2024/05/07 16:37:33.799270 [INF]   Starting restore for stream 'USERS > OTHER_TRANS'
[357] 2024/05/07 16:37:33.801027 [INF]   Restored 1 messages for stream 'USERS > OTHER_TRANS' in 2ms
[357] 2024/05/07 16:37:33.801230 [INF]   Recovering 1 consumers for stream - 'USERS > 103'
[357] 2024/05/07 16:37:33.802129 [INF]   Recovering 1 consumers for stream - 'USERS > OTHER_TRANS'
[357] 2024/05/07 16:37:33.802821 [INF] Starting JetStream cluster
[357] 2024/05/07 16:37:33.802835 [INF] Creating JetStream metadata controller
[357] 2024/05/07 16:37:33.805499 [INF] JetStream cluster recovering state
[357] 2024/05/07 16:37:33.807114 [INF] Listening for client connections on 127.0.0.1:4221
[357] 2024/05/07 16:37:33.807157 [INF] Server is ready
[357] 2024/05/07 16:37:33.807224 [INF] Cluster name is nats-cluster
[357] 2024/05/07 16:37:33.807416 [INF] Listening for route connections on 127.0.0.1:6221
[357] 2024/05/07 16:37:33.808410 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6222: connect: connection refused
[357] 2024/05/07 16:37:33.808428 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6222: connect: connection refused
[357] 2024/05/07 16:37:33.808487 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6223: connect: connection refused
[357] 2024/05/07 16:37:33.808596 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6223: connect: connection refused
panic: runtime error: makeslice: cap out of range

goroutine 8 [running]:
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc0003e0340, {0xc000604000, 0x310, 0x310})
        github.com/nats-io/nats-server/v2/server/filestore.go:4821 +0x1aa
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc0003e0340)
        github.com/nats-io/nats-server/v2/server/filestore.go:5203 +0x489
github.com/nats-io/nats-server/v2/server.(*msgBlock).fetchMsg(0xc0003e0340, 0xe, 0xb04040?)
        github.com/nats-io/nats-server/v2/server/filestore.go:5233 +0xcf
github.com/nats-io/nats-server/v2/server.(*fileStore).msgForSeq(0xc0003ea000, 0xb04040?, 0x7f38dc762c01?)
        github.com/nats-io/nats-server/v2/server/filestore.go:5389 +0x145
github.com/nats-io/nats-server/v2/server.(*fileStore).LoadMsg(0xc000080000?, 0xc000602090?, 0x0?)
        github.com/nats-io/nats-server/v2/server/filestore.go:5528 +0x13
github.com/nats-io/nats-server/v2/server.(*raft).loadEntry(0xc0003ea300, 0xb23240?)
        github.com/nats-io/nats-server/v2/server/raft.go:2535 +0x46
github.com/nats-io/nats-server/v2/server.(*raft).loadFirstEntry(0xc0003ea300)
        github.com/nats-io/nats-server/v2/server/raft.go:2336 +0x55
github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode(0xc000186d80, {0xc0001ffdc8, 0x5}, 0xc00050c368, 0xc00050c900)
        github.com/nats-io/nats-server/v2/server/raft.go:441 +0xdd6
github.com/nats-io/nats-server/v2/server.(*jetStream).createRaftGroup(0xc000252000, {0xc0001ffdc8, 0x5}, 0xc0000969c0, 0x16, 0x1?)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:2059 +0xa2b
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer(0xc000252000, 0xc00019d3b0, 0x0, 0x1)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:4123 +0x67b
github.com/nats-io/nats-server/v2/server.(*jetStream).processConsumerAssignment(0xc000252000, 0xc00019d3b0)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:3977 +0x8c5
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaSnapshot(0xc000252000, {0xc0002ca036, 0x49e, 0x4a7}, 0xc00050d998, 0x1)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1593 +0x1087
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaEntries(0xc000252000, {0xc000280038, 0x1, 0x0?}, 0xc00050d998)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1853 +0xb13
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorCluster(0xc000252000)
        github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1344 +0xb5c
github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine.func1()
        github.com/nats-io/nats-server/v2/server/server.go:3609 +0x17f
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine in goroutine 1
        github.com/nats-io/nats-server/v2/server/server.go:3605 +0x249

where I have my config file:

server_name: payhub-nats-cluster1-node1
listen: 127.0.0.1:4221

accounts {
        SYS {
                users = [{user: "avanzasys", pass: "avanza123"}]
},
USERS {
        jetstream: enabled,
        users = [ {user: "bbkavanza", pass: "$2a$11$cHEH6ZBt9627ppEHBsBIE.TiI51Ng7PjA9n/gWEMvZkR4ntMAiM2O"} ]
}
}
system_account: SYS
jetstream {
  max_file_store: 86870912
}

cluster {
  name: nats-cluster
  listen: 127.0.0.1:6221
  routes: [
    nats://127.0.0.1:6222
    nats://127.0.0.1:6223
  ]
}

but when I comment cluster portion in the config file it works fine.

nats-io / nats-server