Open kamilhalat opened 11 months ago
Might suggest upgrading to 2.9.21.
Hi, sadly it happened again on version 2.9.21, don't have full log at the moment and i have only 50 lines from every node:
node-0:
[7] 2023/09/14 16:32:49.460496 [INF] Starting nats-server
[7] 2023/09/14 16:32:49.460528 [INF] Version: 2.9.21
[7] 2023/09/14 16:32:49.460530 [INF] Git: [b2e7725]
[7] 2023/09/14 16:32:49.460532 [DBG] Go build: go1.19.12
[7] 2023/09/14 16:32:49.460533 [INF] Cluster: nats
[7] 2023/09/14 16:32:49.460535 [INF] Name: nats-0
[7] 2023/09/14 16:32:49.460539 [INF] Node: S1Nunr6R
[7] 2023/09/14 16:32:49.460541 [INF] ID: NDOBT43MQZYHTSUG7US2XNLTJCNELF3C3YTVWIVQ6G76D6JIQ27JRRO7
[7] 2023/09/14 16:32:49.460873 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:49.461533 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:49.461611 [INF] Starting JetStream
[7] 2023/09/14 16:32:49.467808 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[7] 2023/09/14 16:32:49.467821 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[7] 2023/09/14 16:32:49.467823 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[7] 2023/09/14 16:32:49.467825 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[7] 2023/09/14 16:32:49.467826 [INF]
[7] 2023/09/14 16:32:49.467828 [INF] https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:49.467829 [INF]
[7] 2023/09/14 16:32:49.467831 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:49.467915 [INF] Max Memory: 1.00 GB
[7] 2023/09/14 16:32:49.467924 [INF] Max Storage: 8.00 GB
[7] 2023/09/14 16:32:49.467926 [INF] Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:49.467928 [INF] -------------------------------------------
[7] 2023/09/14 16:32:49.468086 [DBG] Exports:
[7] 2023/09/14 16:32:49.468093 [DBG] $JS.API.>
[7] 2023/09/14 16:32:49.468286 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:49.468296 [DBG] Max Memory: -1 B
[7] 2023/09/14 16:32:49.468298 [DBG] Max Storage: -1 B
[7] 2023/09/14 16:32:49.468313 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:49.474190 [INF] Starting restore for stream 'test > STREAM1'
[7] 2023/09/14 16:32:49.479714 [INF] Restored 0 messages for stream 'test > STREAM1'
[7] 2023/09/14 16:32:49.480039 [INF] Starting restore for stream 'test > STREAM2'
[7] 2023/09/14 16:32:49.480473 [INF] Restored 0 messages for stream 'test > STREAM2'
[7] 2023/09/14 16:32:49.480684 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:49.480708 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:49.480715 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:49.480731 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:49.490203 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:49.495625 [WRN] RAFT [S1Nunr6R - _meta_] Snapshot corrupt, checksums did not match
[7] 2023/09/14 16:32:49.495691 [DBG] RAFT [S1Nunr6R - _meta_] Started
[7] 2023/09/14 16:32:49.495755 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:49.495760 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:49.495885 [DBG] ip=x.x.x.x
[7] 2023/09/14 16:32:49.496029 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/09/14 16:32:49.496033 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:49.496129 [DBG] ip=x.x.x.x
[7] 2023/09/14 16:32:49.496135 [INF] Server is ready
[7] 2023/09/14 16:32:49.496327 [DBG] maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
[7] 2023/09/14 16:32:49.496418 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:49.496445 [DBG] Recovered JetStream cluster metadata
node-1:
[7] 2023/09/14 16:32:51.314526 [INF] Starting nats-server
[7] 2023/09/14 16:32:51.314569 [INF] Version: 2.9.21
[7] 2023/09/14 16:32:51.314571 [INF] Git: [b2e7725]
[7] 2023/09/14 16:32:51.314574 [DBG] Go build: go1.19.12
[7] 2023/09/14 16:32:51.314575 [INF] Cluster: nats
[7] 2023/09/14 16:32:51.314577 [INF] Name: nats-1
[7] 2023/09/14 16:32:51.314580 [INF] Node: yrzKKRBu
[7] 2023/09/14 16:32:51.314582 [INF] ID: NA2IKGH4ZTUDY3QT5LHB3ZJ35RZEVR6DIZ4MD5M2XCTCHW6O6ZE6L6G4
[7] 2023/09/14 16:32:51.314618 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:51.315188 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:51.315218 [INF] Starting JetStream
[7] 2023/09/14 16:32:51.329212 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[7] 2023/09/14 16:32:51.329221 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[7] 2023/09/14 16:32:51.329226 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[7] 2023/09/14 16:32:51.329228 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[7] 2023/09/14 16:32:51.329230 [INF]
[7] 2023/09/14 16:32:51.329232 [INF] https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:51.329233 [INF]
[7] 2023/09/14 16:32:51.329235 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:51.329242 [INF] Max Memory: 1.00 GB
[7] 2023/09/14 16:32:51.329245 [INF] Max Storage: 8.00 GB
[7] 2023/09/14 16:32:51.329247 [INF] Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:51.329248 [INF] -------------------------------------------
[7] 2023/09/14 16:32:51.329343 [DBG] Exports:
[7] 2023/09/14 16:32:51.329346 [DBG] $JS.API.>
[7] 2023/09/14 16:32:51.339013 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:51.339024 [DBG] Max Memory: -1 B
[7] 2023/09/14 16:32:51.339027 [DBG] Max Storage: -1 B
[7] 2023/09/14 16:32:51.339042 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:51.350189 [INF] Starting restore for stream 'test > STREAM3'
[7] 2023/09/14 16:32:52.008518 [INF] Restored 1,000,000 messages for stream 'test > STREAM3'
[7] 2023/09/14 16:32:52.020624 [INF] Starting restore for stream 'test > STREAM5'
[7] 2023/09/14 16:32:52.356902 [INF] Restored 367,730 messages for stream 'test > STREAM5'
[7] 2023/09/14 16:32:52.374432 [INF] Starting restore for stream 'test > STREAM4'
[7] 2023/09/14 16:32:52.546136 [INF] Restored 158,887 messages for stream 'test > STREAM4'
[7] 2023/09/14 16:32:52.550342 [INF] Starting restore for stream 'test > PING'
[7] 2023/09/14 16:32:52.554642 [INF] Restored 0 messages for stream 'test > PING'
[7] 2023/09/14 16:32:52.558290 [INF] Recovering 4 consumers for stream - 'test > STREAM3'
[7] 2023/09/14 16:32:52.599885 [INF] Recovering 8 consumers for stream - 'test > STREAM5'
[7] 2023/09/14 16:32:52.642655 [INF] Recovering 4 consumers for stream - 'test > STREAM4'
[7] 2023/09/14 16:32:52.650982 [INF] Recovering 1 consumers for stream - 'test > PING'
[7] 2023/09/14 16:32:52.654863 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:52.654950 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:52.654953 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:52.654956 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:52.668319 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:52.672809 [DBG] RAFT [yrzKKRBu - _meta_] Started
[7] 2023/09/14 16:32:52.673016 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:52.673051 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:52.673999 [DBG] Get non local IPs for "0.0.0.0"
node-2:
[7] 2023/09/14 16:32:50.377838 [INF] Starting nats-server
[7] 2023/09/14 16:32:50.377886 [INF] Version: 2.9.21
[7] 2023/09/14 16:32:50.377899 [INF] Git: [b2e7725]
[7] 2023/09/14 16:32:50.377911 [DBG] Go build: go1.19.12
[7] 2023/09/14 16:32:50.377922 [INF] Cluster: nats
[7] 2023/09/14 16:32:50.377934 [INF] Name: nats-2
[7] 2023/09/14 16:32:50.377947 [INF] Node: cnrtt3eg
[7] 2023/09/14 16:32:50.377958 [INF] ID: NDXDU7MK6I7GMZR5GVBM422IXA2CJA4E5MPOBKJOHT3KIUILKPPAVZHM
[7] 2023/09/14 16:32:50.377994 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:50.378512 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:50.378582 [INF] Starting JetStream
[7] 2023/09/14 16:32:50.391948 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[7] 2023/09/14 16:32:50.391996 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[7] 2023/09/14 16:32:50.392059 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[7] 2023/09/14 16:32:50.392094 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[7] 2023/09/14 16:32:50.392108 [INF]
[7] 2023/09/14 16:32:50.392122 [INF] https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:50.392133 [INF]
[7] 2023/09/14 16:32:50.392147 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:50.392177 [INF] Max Memory: 1.00 GB
[7] 2023/09/14 16:32:50.392201 [INF] Max Storage: 8.00 GB
[7] 2023/09/14 16:32:50.392221 [INF] Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:50.392233 [INF] -------------------------------------------
[7] 2023/09/14 16:32:50.392348 [DBG] Exports:
[7] 2023/09/14 16:32:50.392379 [DBG] $JS.API.>
[7] 2023/09/14 16:32:50.400506 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:50.400554 [DBG] Max Memory: -1 B
[7] 2023/09/14 16:32:50.400571 [DBG] Max Storage: -1 B
[7] 2023/09/14 16:32:50.400605 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:50.412566 [INF] Starting restore for stream 'test > STREAM6'
[7] 2023/09/14 16:32:50.551325 [INF] Restored 21 messages for stream 'test > STREAM6'
[7] 2023/09/14 16:32:50.555737 [INF] Recovering 1 consumers for stream - 'test > STREAM6'
[7] 2023/09/14 16:32:50.565477 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:50.565551 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:50.565559 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:50.565561 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:50.572642 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:50.573388 [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
[7] 2023/09/14 16:32:50.573511 [DBG] RAFT [cnrtt3eg - _meta_] Started
[7] 2023/09/14 16:32:50.573624 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:50.573694 [DBG] Recovered JetStream cluster metadata
[7] 2023/09/14 16:32:50.573696 [DBG] JetStream cluster checking for orphans
[7] 2023/09/14 16:32:50.573715 [WRN] Detected orphaned stream 'test > STREAM6', will cleanup
[7] 2023/09/14 16:32:50.573631 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:50.573859 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:50.573985 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2023/09/14 16:32:50.574062 [DBG] ip=y.y.y.y
[7] 2023/09/14 16:32:50.574295 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/09/14 16:32:50.574304 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:50.574493 [DBG] ip=y.y.y.y
I tried to reproduce it by sending lots of messages (10 000 000) and restarting AKS at the same time but it didn't happen. Do you have other ideas how to reproduce it?
Hi all, I wanted to add an update to this. With server version 2.10.12, this issue is much harder to reproduce, but is still present.
I've attached server logs with debug tracing.
We are still seeing: [7] [WRN] RAFT [cnrtt3eg - meta] Snapshot corrupt, too short when this happens.
[7] [INF] Creating JetStream metadata controller
[7] [WRN] Filestore [_meta_] Stream state too short (0 bytes)
[7] [INF] JetStream cluster recovering state
[7] [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
[7] [DBG] RAFT [cnrtt3eg - _meta_] Started
[7] [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] [DBG] Get non local IPs for "0.0.0.0"
[7] [DBG] ip=172.16.0.43
[7] [INF] Listening for MQTT clients on mqtt://0.0.0.0:1883
[7] [INF] Listening for client connections on 0.0.0.0:4222
In some cases, the corrupt message is seen when no streams configured on the server.
Note, there are a very large number of leaf nodes in this setup.
The backing file system is Azure storage, so I suspect this could be the file system the AKS PFC and setting the sync_interval value in the configuration to force writes using a lower value (1s) or even "always" might remedy this, but comes at a non-trivial performance cost. Any thoughts there would be appreciated.
Would there be any additional information you'd like to help debug this? Happy to hop on a call to discuss further; we can touch base over slack.
2/4/2024: UPDATE: MQTT is enabled so I'm thinking this may be occurring with the internal MQTT streams at times.
Do we have a reproducible test case or is this only seen in the user's environment?
This is only seen in the user's environment atm - it takes many iterations of stopping/restarting the server to reproduce.
Do they have an option of local storage to the node instead?
I'll check and let you know. The user's policy is to shut down the entire cluster nightly - we can look into local storage and incorporating snapshots/restores into their procedures.
ok. A short read is either the storage system did not write properly and truncated or truncated a read..
Unfortunately, using local storage in the container won't work for this case. We did verify the best storage class is being used in AKS. We'll experiment with sync_interval
and if worse comes to worse deploy a NATS cluster outside of k8s on VMs. Additional suggestions would be appreciated...
Defect
On AKS restarts (not every time) all on startup all streams are getting flagged as orphend and are deleted:
nats-0 logs:
nats-1 logs:
nats-2 logs:
IPs x.x.x.x, y.y.y.y and z.z.z.z was change for anonymity
Make sure that these boxes are checked before submitting your issue -- thank you!
sadly still waiting for it to happen on NATS server with trace and debug enabled
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:Lates tested version:
2.9.20-alpine
Fist encounter:2.9.16-alpine
OS/Container environment:
AKS, helm values:
Steps or code to reproduce the issue:
Expected result:
Actual result: