Closed ajlane closed 2 years ago
Thanks, will take a look and make sure its either already fixed or will be in next release.
What hardware we you running on? Does it crash everytime you start the server? Would you be willing to share your storage directory so we can reproduce here?
I get what could be happening but curious which of the two theories are correct for your use case.
Thanks.
The node was an n1-standard-8. It was in a crash loop that never recovered.
I don't have the original storage directory, because I've since re-installed and restarted everything - I'll try to reproduce it.
One thing I failed to mention is that I am also updating a small number of values in a kvstore for each message, in case that's relevant.
ok thanks. Might be good top upgrade to the latest. We are stabilizing quickly but the use cases for the tech are so diverse and the usage so high that we are going as fast as we can to address any issues.
I reproduced the error. This is the contents of the data folder: jetstream.zip
That's super helpful. Thank you. Will jump on this tomorrow and keep you posted.
The index file seems corrupt. What steps did you take to reproduce, and just double checking this is server version 2.8.2 yes?
Also for business continuity, if you remove the index file for blk 63, so 63.idx, the system should repair itself. I am trying to hunt down where we corrupt the idx file. It's not a filesystem thing, we wrap negative for msgs and bytes is what I believe.
Do you use erase vs delete for removing messages?
Also do you know if you call Truncate()?
Interesting.
The reproduced error was also on 2.8.0, because that is the latest version installed by helm. I will attempt to replicate with 2.8.2.
I don't remove any messages, as far as I know. Maybe nats.py does something surprising. I subscribe to a handful of queues from a swarm of workers. Some messages take quite some time to be ack'd, but otherwise I don't think there's anything complicated happening.
To reproduce, I just re-installed from scratch and deployed all of my consumers. After a couple of hours, I shut down the node to force it to move to a new node. I suspect this might be a red herring though, the server could have been in a bad state before I shut down the node, I'll try to be more observant the next time around.
I'll also look at trying to produce a cut-down version of my nats.py code to share.
No worries, I think I have a clue. Working on a test case now.
Unfortunately, I can confirm that the error persists in 2.8.2, even if I leave the node alone.
So a restart is not needed? Or is your system still doing automated server shutdowns on its on?
I've just had a chance to look through the logs to see what happened.
The server was under-resourced. I wasn't setting resource.requests and resource.limits with helm, and so it was receiving only token amounts of cpu and memory despite having plenty of available resources on the node. Because the server is so light and efficient, this wasn't immediately obvious.
The process was suddenly terminating and restarting periodically, with no particular errors being logged. It sucessfully restored itself many times before we began to see the panic. Once that happened, the server consistently panicked on startup.
In case it's relevant, there are also a lot of these messages being logged:
[7] 2022/05/12 07:38:47.720579 [INF] 10.56.137.7:45998 - cid:993 - "v2.1.0:python3" - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 2533 total bytes.
These are not particularly surprising to me, I know I have some messages that take a long time to process.
I am still looking for the edge condition which triggers the negative roll around in accounting. Have not found it yet but I will, also already have code to detect on restart and fix on the fly once detected. Thanks again for your patience.
Would it be possible to run a nightly build when it is staged to see if that resolves it?
If you could give the nightly docker image, under synadia/nats-server:nightly a shot and let us know if the problem still repeats. Thanks again for the info.
After four hours of running the nightly image, the server has not restarted on its own at all. Manual restarts recover fine.
Thanks for the feedback, much appreciated. And thanks for your patience. We will most likely do a 2.8.3 release early next week.
Facing the same problem:
when i start nats-server with cluster mode, command:
nats-server -js -c node1.config
output:
[357] 2024/05/07 16:37:33.792265 [INF] Starting nats-server
[357] 2024/05/07 16:37:33.792649 [INF] Version: 2.10.0
[357] 2024/05/07 16:37:33.792660 [INF] Git: [77e1983]
[357] 2024/05/07 16:37:33.792670 [INF] Cluster: nats-cluster
[357] 2024/05/07 16:37:33.792676 [INF] Name: payhub-nats-cluster1-node1
[357] 2024/05/07 16:37:33.792686 [INF] Node: D72AFbUz
[357] 2024/05/07 16:37:33.792692 [INF] ID: NDFJBQANKO32UMWKKVZWQDNQXJVGRRTS234ZDAAGOBMP4CE75JFM6Y55
[357] 2024/05/07 16:37:33.792811 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[357] 2024/05/07 16:37:33.792841 [INF] Using configuration file: node1.config
[357] 2024/05/07 16:37:33.793512 [INF] Starting JetStream
[357] 2024/05/07 16:37:33.794490 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[357] 2024/05/07 16:37:33.794506 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[357] 2024/05/07 16:37:33.794511 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[357] 2024/05/07 16:37:33.794519 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[357] 2024/05/07 16:37:33.794529 [INF]
[357] 2024/05/07 16:37:33.794538 [INF] https://docs.nats.io/jetstream
[357] 2024/05/07 16:37:33.794544 [INF]
[357] 2024/05/07 16:37:33.794549 [INF] ---------------- JETSTREAM ----------------
[357] 2024/05/07 16:37:33.794559 [INF] Max Memory: 11.91 GB
[357] 2024/05/07 16:37:33.794567 [INF] Max Storage: 82.85 MB
[357] 2024/05/07 16:37:33.794573 [INF] Store Directory: "/tmp/nats/jetstream"
[357] 2024/05/07 16:37:33.794583 [INF] -------------------------------------------
[357] 2024/05/07 16:37:33.796113 [INF] Starting restore for stream 'USERS > 103'
[357] 2024/05/07 16:37:33.798827 [INF] Restored 0 messages for stream 'USERS > 103' in 3ms
[357] 2024/05/07 16:37:33.799270 [INF] Starting restore for stream 'USERS > OTHER_TRANS'
[357] 2024/05/07 16:37:33.801027 [INF] Restored 1 messages for stream 'USERS > OTHER_TRANS' in 2ms
[357] 2024/05/07 16:37:33.801230 [INF] Recovering 1 consumers for stream - 'USERS > 103'
[357] 2024/05/07 16:37:33.802129 [INF] Recovering 1 consumers for stream - 'USERS > OTHER_TRANS'
[357] 2024/05/07 16:37:33.802821 [INF] Starting JetStream cluster
[357] 2024/05/07 16:37:33.802835 [INF] Creating JetStream metadata controller
[357] 2024/05/07 16:37:33.805499 [INF] JetStream cluster recovering state
[357] 2024/05/07 16:37:33.807114 [INF] Listening for client connections on 127.0.0.1:4221
[357] 2024/05/07 16:37:33.807157 [INF] Server is ready
[357] 2024/05/07 16:37:33.807224 [INF] Cluster name is nats-cluster
[357] 2024/05/07 16:37:33.807416 [INF] Listening for route connections on 127.0.0.1:6221
[357] 2024/05/07 16:37:33.808410 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6222: connect: connection refused
[357] 2024/05/07 16:37:33.808428 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6222: connect: connection refused
[357] 2024/05/07 16:37:33.808487 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6223: connect: connection refused
[357] 2024/05/07 16:37:33.808596 [ERR] Error trying to connect to route (attempt 1): dial tcp 127.0.0.1:6223: connect: connection refused
panic: runtime error: makeslice: cap out of range
goroutine 8 [running]:
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc0003e0340, {0xc000604000, 0x310, 0x310})
github.com/nats-io/nats-server/v2/server/filestore.go:4821 +0x1aa
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc0003e0340)
github.com/nats-io/nats-server/v2/server/filestore.go:5203 +0x489
github.com/nats-io/nats-server/v2/server.(*msgBlock).fetchMsg(0xc0003e0340, 0xe, 0xb04040?)
github.com/nats-io/nats-server/v2/server/filestore.go:5233 +0xcf
github.com/nats-io/nats-server/v2/server.(*fileStore).msgForSeq(0xc0003ea000, 0xb04040?, 0x7f38dc762c01?)
github.com/nats-io/nats-server/v2/server/filestore.go:5389 +0x145
github.com/nats-io/nats-server/v2/server.(*fileStore).LoadMsg(0xc000080000?, 0xc000602090?, 0x0?)
github.com/nats-io/nats-server/v2/server/filestore.go:5528 +0x13
github.com/nats-io/nats-server/v2/server.(*raft).loadEntry(0xc0003ea300, 0xb23240?)
github.com/nats-io/nats-server/v2/server/raft.go:2535 +0x46
github.com/nats-io/nats-server/v2/server.(*raft).loadFirstEntry(0xc0003ea300)
github.com/nats-io/nats-server/v2/server/raft.go:2336 +0x55
github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode(0xc000186d80, {0xc0001ffdc8, 0x5}, 0xc00050c368, 0xc00050c900)
github.com/nats-io/nats-server/v2/server/raft.go:441 +0xdd6
github.com/nats-io/nats-server/v2/server.(*jetStream).createRaftGroup(0xc000252000, {0xc0001ffdc8, 0x5}, 0xc0000969c0, 0x16, 0x1?)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:2059 +0xa2b
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer(0xc000252000, 0xc00019d3b0, 0x0, 0x1)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:4123 +0x67b
github.com/nats-io/nats-server/v2/server.(*jetStream).processConsumerAssignment(0xc000252000, 0xc00019d3b0)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:3977 +0x8c5
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaSnapshot(0xc000252000, {0xc0002ca036, 0x49e, 0x4a7}, 0xc00050d998, 0x1)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1593 +0x1087
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaEntries(0xc000252000, {0xc000280038, 0x1, 0x0?}, 0xc00050d998)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1853 +0xb13
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorCluster(0xc000252000)
github.com/nats-io/nats-server/v2/server/jetstream_cluster.go:1344 +0xb5c
github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine.func1()
github.com/nats-io/nats-server/v2/server/server.go:3609 +0x17f
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine in goroutine 1
github.com/nats-io/nats-server/v2/server/server.go:3605 +0x249
where I have my config file:
server_name: payhub-nats-cluster1-node1
listen: 127.0.0.1:4221
accounts {
SYS {
users = [{user: "avanzasys", pass: "avanza123"}]
},
USERS {
jetstream: enabled,
users = [ {user: "bbkavanza", pass: "$2a$11$cHEH6ZBt9627ppEHBsBIE.TiI51Ng7PjA9n/gWEMvZkR4ntMAiM2O"} ]
}
}
system_account: SYS
jetstream {
max_file_store: 86870912
}
cluster {
name: nats-cluster
listen: 127.0.0.1:6221
routes: [
nats://127.0.0.1:6222
nats://127.0.0.1:6223
]
}
but when I comment cluster portion in the config file it works fine.
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:Messages were sent and consumed using nats.py 2.1.0.
OS/Container environment:
GKE
Steps or code to reproduce the issue:
Expected result:
Not to panic, to recover gracefully, or at least to provide a more instructive error message.
Actual result: