Open stitchcula opened 5 months ago
After cutting the power, could you snapshot the store directory and share with us privately after reboot but before you restart the nats-server?
Also, what size payloads have you set the system to use? I see it is complaining they are over 8MB which we strongly discourage.
Or provide more information on number of messages inbound into the WQ, how big are they, how are the consumers operating? Does the stream usually have messages or do the consumers usually consumer messages pretty quickly and keep the number of messages very low or at zero?
same problem on kv tests and reboot container by drain docker node or reboot VM https://github.com/nats-io/nats-server/issues/5205
I have encountered this problem. The sequence of consumer exceeds that of stream, so I cannot get new messages from stream until the sequence of stream exceeds that of consumer.
We are trying to reproduce, so any and all information on how you are interacting with the stream is helpful.
How fast messages are being sent into the system? How fast they are being consumed?
In general is the stream near empty or at empty most of the time?
in our case, 1 msg per sec was sent cluster has 5 node
we drained nodes with meta leader and stream leader waited 2 minute activated nodes again after all, consumer didn't see all message (etc some messages were pulling, but jetstream_consumer_num_pending was increasing)
also we have problems where we drain not leader node after all, jetstream_stream_last_seq in drained node is less than in not drained node that difference increases by time
Thanks everyone for sharing about the symptoms from this issue, currently we're looking at a reproducible example to trip this condition.
@stitchcula I see in the logs that you have a warning of Maximum payloads over 8.00 MB
, do you have an idea of the size of the messages that you are sending or whether persisted messages are close to 8MB?
Thanks everyone for sharing about the symptoms from this issue, currently we're looking at a reproducible example to trip this condition. @stitchcula I see in the logs that you have a warning of
Maximum payloads over 8.00 MB
, do you have an idea of the size of the messages that you are sending or whether persisted messages are close to 8MB?
We have a standard 1mb payloads. We test on version master -> 2.10.5 and always when restarting the docker server problem with consumers and problem with last sequence reproduce
In my testing, the size of payload is around 50KB, with a consumption rate of 100 messages per second. The consumers usually consumer messages pretty quickly and keep the number of messages very low.
To avoid incidents in production, I temporarily forked a branch and made an aggressive fix, where if the consumer initialization finds the ack floor higher than the stream's lastSeq, it forcibly resets to lastSeq.
This allows the consumer to consume normally, but data loss occurs. I believe the fundamental issue is that the persistence mechanism of the stream during a power outage can be improved to avoid losing the cursor.
snapshot the store directory and share with us privately after reboot but before you restart the nats-server
I have already taken a snapshot of the store directory after rebooting but before restarting the nats-server. How can I share it with you?
Thanks for the hard work!
@stitchcula you can send to wally@nats.io if possible?
We believe this is now fixed, and will be part of 2.10.17 release.
I fall on the same exact issue when I tested the behaviour on power outage with 2.10.21.
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.627588 [INF] Starting restore for stream '$G > $MQTT_msgs'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.633985 [INF] Restored 0 messages for stream '$G > $MQTT_msgs' in 6ms
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.634478 [INF] Starting restore for stream '$G > $MQTT_out'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.635200 [INF] Restored 0 messages for stream '$G > $MQTT_out' in 1ms
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.635734 [INF] Starting restore for stream '$G > $MQTT_qos2in'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.636200 [INF] Restored 0 messages for stream '$G > $MQTT_qos2in' in 0s
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.636755 [INF] Starting restore for stream '$G > $MQTT_rmsgs'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.637255 [INF] Restored 0 messages for stream '$G > $MQTT_rmsgs' in 0s
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.638113 [INF] Starting restore for stream '$G > $MQTT_sess'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.638857 [INF] Restored 72 messages for stream '$G > $MQTT_sess' in 1ms
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.639358 [INF] Starting restore for stream '$G > metrics_cloud'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.640232 [INF] Restored 26 messages for stream '$G > metrics_cloud' in 1ms
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.641628 [INF] Starting restore for stream '$G > metrics_latest'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.643215 [INF] Restored 1 messages for stream '$G > metrics_latest' in 2ms
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.644541 [INF] Recovering 4 consumers for stream - '$G > $MQTT_rmsgs'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.644978 [WRN] Error unmarshalling consumer metafile "/var/lib/nats-server/data/jetstream/$G/streams/$MQTT_rmsgs/obs/$MQTT_rmsgs_NC7S88QeqzzL8xGevQSyfD/meta.inf": unexpected end of JSON input
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.648010 [INF] Recovering 1 consumers for stream - '$G > metrics_cloud'
Oct 8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.648637 [WRN] Detected consumer '$G > metrics_cloud > X_supervision' ack floor 247 is ahead of stream's last sequence 26
I will investigate in the coming days, but the issue seems from the fact that the writes of the stream states is interrupted in the middle, leading to a corrupted file/state. The new data is written over the old one? (Seems the case but not sure). If this is the case maybe writing the new meta/state in a new directory, then os.rename the new one over the old, will at least guarantee that the data is not corrupted.
Do you do lots of stream updates?
What does that file (meta.inf) look like, could you share with us?
Not really, I have 1 stream in memory with 1 msg by second, and a copy in another stream on the disk of 1 message every 10 seconds. Less than 1k by message. The server is on "edge" with no power backup.
I will try to investigate and create an issue/pr according my findings, and try to share will you all the Jetstream store when the issue rises.
I manage to reproduce it on my computer: Download bitnami virtualbox appliance (will allow to simulate a power outage): https://bitnami.com/stack/nats/virtual-machine Open port, enable jetstream in a /opt/bitnami/nats:
Create a stream: ./nats -s nats://192.168.YY.XX stream create --config 0001_latest.json
{
"config": {
"name": "metrics_latest",
"subjects": [
"metrics.>"
],
"retention": "limits",
"max_consumers": -1,
"max_msgs_per_subject": 1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 120000000000,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 1,
"duplicate_window": 120000000000,
"sealed": false,
"deny_delete": false,
"deny_purge": false,
"allow_rollup_hdrs": false,
"allow_direct": true,
"mirror_direct": false,
"consumer_limits": {}
}
}
Add some data: watch -n 0.1 ./nats -s nats://192.168.YY.XX pub metrics.TEST "AAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQ"
And try multiple times to do a restart of the VM
After few tries
[352] 2024/10/13 10:21:06.338321 [INF] Starting nats-server
[352] 2024/10/13 10:21:06.342972 [INF] Version: 2.10.21
[352] 2024/10/13 10:21:06.342977 [INF] Git: [d3a8868]
[352] 2024/10/13 10:21:06.342980 [INF] Name: NDTMWG6VXQF66HSNFB5BO6HCP2VX5WOJ6ISSPITYXNFVFYT6NCHCIJYI
[352] 2024/10/13 10:21:06.342984 [INF] Node: cO2FmMr9
[352] 2024/10/13 10:21:06.342986 [INF] ID: NDTMWG6VXQF66HSNFB5BO6HCP2VX5WOJ6ISSPITYXNFVFYT6NCHCIJYI
[352] 2024/10/13 10:21:06.342992 [INF] Using configuration file: /opt/bitnami/nats/conf/nats-server.conf
[352] 2024/10/13 10:21:06.344746 [INF] Starting http monitor on 0.0.0.0:8222
[352] 2024/10/13 10:21:06.345571 [INF] Starting JetStream
[352] 2024/10/13 10:21:06.345638 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[352] 2024/10/13 10:21:06.345641 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[352] 2024/10/13 10:21:06.345643 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[352] 2024/10/13 10:21:06.345645 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[352] 2024/10/13 10:21:06.345646 [INF]
[352] 2024/10/13 10:21:06.345648 [INF] https://docs.nats.io/jetstream
[352] 2024/10/13 10:21:06.345649 [INF]
[352] 2024/10/13 10:21:06.345651 [INF] ---------------- JETSTREAM ----------------
[352] 2024/10/13 10:21:06.345655 [INF] Max Memory: 953.67 MB
[352] 2024/10/13 10:21:06.345657 [INF] Max Storage: 14.90 GB
[352] 2024/10/13 10:21:06.345659 [INF] Store Directory: "/opt/bitnami/nats/jetstream/jetstream"
[352] 2024/10/13 10:21:06.345660 [INF] -------------------------------------------
[352] 2024/10/13 10:21:06.348814 [INF] Starting restore for stream '$G > metrics_latest'
[352] 2024/10/13 10:21:06.350048 [WRN] Filestore [metrics_latest] Stream state outdated, last block has additional entries, will rebuild
[352] 2024/10/13 10:21:06.350054 [WRN] Filestore [metrics_latest] Recovering stream state from index errored: prior state file
[352] 2024/10/13 10:21:06.356364 [INF] Restored 0 messages for stream '$G > metrics_latest' in 8ms
[352] 2024/10/13 10:21:06.356426 [INF] Listening for client connections on 0.0.0.0:4222
[352] 2024/10/13 10:21:06.356498 [ERR] Address "0.0.0.0" can not be resolved properly
[352] 2024/10/13 10:21:06.356504 [INF] Server is ready
A backup of jetstream just before restarting it: https://file.io/fSzkRNDrDXqM
With multiple streams and consumers there are other consequences:
[4602] 2024/10/13 12:00:11.931818 [INF] Starting JetStream
[4602] 2024/10/13 12:00:11.931949 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[4602] 2024/10/13 12:00:11.931971 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[4602] 2024/10/13 12:00:11.931974 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[4602] 2024/10/13 12:00:11.931975 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[4602] 2024/10/13 12:00:11.931977 [INF]
[4602] 2024/10/13 12:00:11.931978 [INF] https://docs.nats.io/jetstream
[4602] 2024/10/13 12:00:11.931980 [INF]
[4602] 2024/10/13 12:00:11.932000 [INF] ---------------- JETSTREAM ----------------
[4602] 2024/10/13 12:00:11.932023 [INF] Max Memory: 953.67 MB
[4602] 2024/10/13 12:00:11.932028 [INF] Max Storage: 14.90 GB
[4602] 2024/10/13 12:00:11.932030 [INF] Store Directory: "/tmp/nats-test/jetstream"
[4602] 2024/10/13 12:00:11.932032 [INF] Domain: XX-system
[4602] 2024/10/13 12:00:11.932034 [INF] API Level: 1
[4602] 2024/10/13 12:00:11.932056 [INF] -------------------------------------------
[4602] 2024/10/13 12:00:11.932413 [INF] Starting restore for stream '$G > $MQTT_msgs'
[4602] 2024/10/13 12:00:11.932668 [INF] Restored 0 messages for stream '$G > $MQTT_msgs' in 0s
[4602] 2024/10/13 12:00:11.932752 [INF] Starting restore for stream '$G > $MQTT_out'
[4602] 2024/10/13 12:00:11.932948 [INF] Restored 0 messages for stream '$G > $MQTT_out' in 0s
[4602] 2024/10/13 12:00:11.933021 [INF] Starting restore for stream '$G > $MQTT_qos2in'
[4602] 2024/10/13 12:00:11.934186 [INF] Restored 0 messages for stream '$G > $MQTT_qos2in' in 1ms
[4602] 2024/10/13 12:00:11.934259 [INF] Starting restore for stream '$G > $MQTT_rmsgs'
[4602] 2024/10/13 12:00:11.934429 [INF] Restored 0 messages for stream '$G > $MQTT_rmsgs' in 0s
[4602] 2024/10/13 12:00:11.934498 [INF] Starting restore for stream '$G > $MQTT_sess'
[4602] 2024/10/13 12:00:11.934698 [INF] Restored 74 messages for stream '$G > $MQTT_sess' in 0s
[4602] 2024/10/13 12:00:11.934771 [INF] Starting restore for stream '$G > metrics_cloud'
[4602] 2024/10/13 12:00:11.934898 [WRN] Filestore [metrics_cloud] Stream state outdated, last block has additional entries, will rebuild
[4602] 2024/10/13 12:00:11.934920 [WRN] Filestore [metrics_cloud] Recovering stream state from index errored: prior state file
[4602] 2024/10/13 12:00:11.938204 [INF] Restored 0 messages for stream '$G > metrics_cloud' in 3ms
[4602] 2024/10/13 12:00:11.938283 [INF] Starting restore for stream '$G > metrics_latest'
[4602] 2024/10/13 12:00:11.938416 [WRN] Filestore [metrics_latest] Stream state outdated, last block has additional entries, will rebuild
[4602] 2024/10/13 12:00:11.938438 [WRN] Filestore [metrics_latest] Recovering stream state from index errored: prior state file
[4602] 2024/10/13 12:00:11.948774 [INF] Restored 0 messages for stream '$G > metrics_latest' in 10ms
[4602] 2024/10/13 12:00:11.948856 [INF] Recovering 2 consumers for stream - '$G > $MQTT_rmsgs'
[4602] 2024/10/13 12:00:11.948906 [WRN] Error unmarshalling consumer metafile "/tmp/nats-test/jetstream/$G/streams/$MQTT_rmsgs/obs/$MQTT_rmsgs_NC7S88QeqzzL8xGevQSyfD/meta.inf": unexpected end of JSON input
[4602] 2024/10/13 12:00:11.949450 [INF] Recovering 1 consumers for stream - '$G > metrics_cloud'
[4602] 2024/10/13 12:00:11.949608 [WRN] Detected consumer '$G > metrics_cloud > XX_supervision' ack floor 16196 is ahead of stream's last sequence 0
[4602] 2024/10/13 12:00:11.949832 [INF] Listening for websocket clients on ws://0.0.0.0:1443
About the last log, the issue seems to come from that metrics_cloud is s2 compressed. I modified the rebuildState call by the following one:
- ld, tombs, _ := mb.rebuildState()
+ ld, tombs, err := mb.rebuildState()
+ if err != nil {
+ return mb, err
+ }
Then the error is this one:
Error recreating stream "metrics_cloud": error reading compression reader: s2: unsupported input (10077)
Not sure how is partial write failure tolerant the "uncompressed" format, but doesn't seem the case for s2.
I continued the investigation: The stream have one consumer, which is pulled continiously. What happen:
The problem with "ZZZZ is ahead of stream's last sequence 0" is due to that behaviour. The state of the consumer "o.dat" is flushed very quickly (open/write/close) which is not the case for the message block where some parts are written but not flushed: the close happens only when the file will be deleted. If the power outtage happen between two message, the data of stream may be lost but very less likely the state of consumer.
@diorcety Can you please confirm whether you are also testing this with sync: always
as in the original post? If not, can you please set it and see if you can reproduce?
jetstream {
sync: always
}
I have a feeling I know what's going on but this should confirm it either way.
@neilalexander Indeed that "fix" my issue. Nevertheless the behavior without "sync: always" seems "incorrect". The sync timer should flush/sync simultanously all the states (streams and consumers) and no other sync outside that timer, no?
Edit: losing few messages maybe an issue, in this case sync always seems to be the way to minimize that issue. But if it is not the case, recovering a good state without few messages should be the behaviour.
I made futher investigations. The issue mainly appears when the stream is empty.
Status: The stream index is outdated (the last block referenced inside the index is already processed and removed), and the new block file is created by it's empty. the consumer state contains a sequence id which is maybe not the last but newer than the one in stream index.
The consumer have an ack sequence id number, but in this case the only sequence available for the stream is saved in the index file, index which will be often outdated. There is no way to know stream message sequence number, because the new block is empty. It supposed to contain a tombstone but tumbstone are not synced.
At a sync point you may have a correct data in stream and consumer, then: 1) The remaining message is processed 2) The message block is removed 3) A new one is created with a tombstone but without sync. Seems that in this case, the remove fs action is persisted by not the tombstone write: at restart I have only a empty message block, Inode file system operations are more quickly synced?
I'm currently experimenting modifications:
https://github.com/diorcety/nats-server/commit/62dadc5b7c29a639c6448082c7515dda565a681b Avoid to overwrite a file, always create a new one and rename it. which will avoid incorrect content, notably when compression is used where the truncation is not only a simple file truncation
https://github.com/diorcety/nats-server/commit/4b665151c794f60641a8f7487fe546a18ba55205 Always sync tombstone. When the last msgblk is removed a new one is directly created with a tombstone, containing stream information. If this information is lost due to power outage for example, we reach the state of this issue
https://github.com/diorcety/nats-server/commit/305cd4e8b04e41e4bbbbf04364d3a42986fce7b2 Sync consumer state at same time that other syncs. Before this change, the state of consumer was never explicitly synced if SyncAlways was not set
I easily reproduced this issue using my platform watchdog (almost everytimes), with these modifications, no more issues after 10 tries. Nats, streams and consumers are correctly restored
trace.txt Before 12:48:19.945151, my customer puller is not started. We can see that only during the periodic sync the files are synced. Then my consumer continuously pulls messages, you can see that after the first write in a new message block file (the tombstone), the file is synced
Observed behavior
When the server is unexpectedly powered off and then restarted, all consumers are unable to self-recover and fail to receive messages. The log is as follows:
After the restart, it was almost 100% likely to reproduce the situation described in the issue, where the consumer's ack floor is greater than the stream's last sequence, resulting in all consumers being unable to consume.
Note that:
This appears to be the last blk index recorded in index.db, which actually cannot be found on the disk. This inconsistency may be caused by non-atomic sync writing of the index.db and blk to disk. When server/filestore.go
func (fs *fileStore) recoverFullState() (rerr error)
cannot find the corresponding [index].blk, it defaults to setting the stream's LastSeq to 0. The consumer's meta is also independently persisted, and server/consumer.gofunc (o *consumer) checkStateForInterestStream() error
reads and finds that the recorded AckFloor is higher than the actual stream's LastSeq, thus throwing an error.By the way, I've noticed that the persistence and recovery mechanisms of the filestore in different versions of the stream vary slightly.
I tested various sync time periods, including 'sync: always', but it doesn't seem to have much effect:
The related issue is https://github.com/nats-io/nats-server/issues/4566 . I believe this is a server-side problem and should not be circumvented by client-side operations like deleting and re-adding consumers.
This issue is beyond my ability to fix quickly. I earnestly seek assistance. @derekcollison @bruth
If you need more information or logs, please feel free to contact me, and I will provide them as soon as possible.
Expected behavior
After the server unexpectedly power off and then restarted, all consumers should able to self-recover and messages should be consumed normally. Some degree of data loss is acceptable.
Server and client version
I conducted power-off tests on the following server versions, and all exhibited the issue:
All client versions have been cross-tested:
Host environment
I have used multiple servers for testing, and have tested with various file systems (xfs, ext4) and types of disks (nano flash, SSD, mechanical hard disk) to rule out the possibility of randomness.
eg:
Steps to reproduce
After normal production and consumption for a period of time, directly cut off the server's power supply and then restart, you can almost certainly reproduce the issue 100% of the time.
Using reboot and kill -9 seems unable to simulate this process.