nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.91k stars 1.41k forks source link

Consumer not receiving messages when power off and restart, consumer's ack floor is ahead of stream's last sequence [v2.10.14] #5412

Open stitchcula opened 5 months ago

stitchcula commented 5 months ago

Observed behavior

When the server is unexpectedly powered off and then restarted, all consumers are unable to self-recover and fail to receive messages. The log is as follows:

[1022] 2024/05/11 16:31:07.334599 [INF] Starting nats-server
[1022] 2024/05/11 16:31:07.382891 [INF]   Version:  2.10.14
[1022] 2024/05/11 16:31:07.383039 [INF]   Git:      [31af767]
[1022] 2024/05/11 16:31:07.383055 [DBG]   Go build: go1.21.9
[1022] 2024/05/11 16:31:07.383112 [INF]   Name:     NCFGS7EGBHTKZ46NGOLKN365O3RGONUATABFDU3MDRJOOWGBOEKH3LQ4
[1022] 2024/05/11 16:31:07.383150 [INF]   Node:     jAsr7q4d
[1022] 2024/05/11 16:31:07.383172 [INF]   ID:       NCFGS7EGBHTKZ46NGOLKN365O3RGONUATABFDU3MDRJOOWGBOEKH3LQ4
[1022] 2024/05/11 16:31:07.383248 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[1022] 2024/05/11 16:31:07.383363 [INF] Using configuration file: /etc/nats/simpleJetStream.conf
[1022] 2024/05/11 16:31:07.383430 [WRN] Maximum payloads over 8.00 MB are generally discouraged and could lead to poor performance
[1022] 2024/05/11 16:31:07.513725 [INF] Starting http monitor on 0.0.0.0:8422
[1022] 2024/05/11 16:31:07.514385 [INF] Starting JetStream
[1022] 2024/05/11 16:31:07.553176 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[1022] 2024/05/11 16:31:07.553260 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[1022] 2024/05/11 16:31:07.553283 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[1022] 2024/05/11 16:31:07.553303 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[1022] 2024/05/11 16:31:07.553323 [INF] 
[1022] 2024/05/11 16:31:07.553342 [INF]          https://docs.nats.io/jetstream
[1022] 2024/05/11 16:31:07.553361 [INF] 
[1022] 2024/05/11 16:31:07.553381 [INF] ---------------- JETSTREAM ----------------
[1022] 2024/05/11 16:31:07.553433 [INF]   Max Memory:      3.73 GB
[1022] 2024/05/11 16:31:07.553464 [INF]   Max Storage:     93.13 GB
[1022] 2024/05/11 16:31:07.553488 [INF]   Store Directory: "/opt/data/nats/jetstream"
[1022] 2024/05/11 16:31:07.553510 [INF] -------------------------------------------
[1022] 2024/05/11 16:31:08.304405 [INF]   Starting restore for stream '$G > XANOLE_OUTPUT_VALUES'
[1022] 2024/05/11 16:31:08.403004 [WRN] Filestore [XANOLE_OUTPUT_VALUES] Stream state detected prior state, could not locate msg block 5054
[1022] 2024/05/11 16:31:08.405046 [INF]   Restored 0 messages for stream '$G > XANOLE_OUTPUT_VALUES' in 101ms
[1022] 2024/05/11 16:31:11.701798 [INF]   Recovering 2 consumers for stream - '$G > XANOLE_OUTPUT_VALUES'
[1022] 2024/05/11 16:31:13.597179 [WRN] Detected consumer '$G > XANOLE_OUTPUT_VALUES > consumer_default' ack floor 14084 is ahead of stream's last sequence 0
[1022] 2024/05/11 16:31:13.597335 [WRN] Detected consumer '$G > XANOLE_OUTPUT_VALUES > consumer_401899c7-a201-42b4-9953-e59d4b427d43' ack floor 14084 is ahead of stream's last sequence 0

After the restart, it was almost 100% likely to reproduce the situation described in the issue, where the consumer's ack floor is greater than the stream's last sequence, resulting in all consumers being unable to consume.

Note that:

[WRN] Filestore [XANOLE_OUTPUT_VALUES] Stream state detected prior state, could not locate msg block 5054
...
[WRN] Detected consumer '$G > XANOLE_OUTPUT_VALUES > consumer_default' ack floor 14084 is ahead of stream's last sequence 0

This appears to be the last blk index recorded in index.db, which actually cannot be found on the disk. This inconsistency may be caused by non-atomic sync writing of the index.db and blk to disk. When server/filestore.go func (fs *fileStore) recoverFullState() (rerr error) cannot find the corresponding [index].blk, it defaults to setting the stream's LastSeq to 0. The consumer's meta is also independently persisted, and server/consumer.go func (o *consumer) checkStateForInterestStream() error reads and finds that the recorded AckFloor is higher than the actual stream's LastSeq, thus throwing an error.

By the way, I've noticed that the persistence and recovery mechanisms of the filestore in different versions of the stream vary slightly.

I tested various sync time periods, including 'sync: always', but it doesn't seem to have much effect:

jetstream {
    store_dir: /userdata/nats
    max_mem: 100M
    max_file: 500M
    sync: always # 1s to 2min
}

The related issue is https://github.com/nats-io/nats-server/issues/4566 . I believe this is a server-side problem and should not be circumvented by client-side operations like deleting and re-adding consumers.

This issue is beyond my ability to fix quickly. I earnestly seek assistance. @derekcollison @bruth

If you need more information or logs, please feel free to contact me, and I will provide them as soon as possible.

Expected behavior

After the server unexpectedly power off and then restarted, all consumers should able to self-recover and messages should be consumed normally. Some degree of data loss is acceptable.

Server and client version

I conducted power-off tests on the following server versions, and all exhibited the issue:

All client versions have been cross-tested:

Host environment

I have used multiple servers for testing, and have tested with various file systems (xfs, ext4) and types of disks (nano flash, SSD, mechanical hard disk) to rule out the possibility of randomness.

eg:

Linux XXX 5.10.110 #3 SMP Tue Oct 24 15:37:17 CST 2023 aarch64 GNU/Linux, ext4, nano flash
Linux XXX 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 GNU/Linux, xfs, SSD
Linux XXX 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 GNU/Linux, xfs, HDD

Steps to reproduce

jetstream {
    store_dir: /userdata/nats
    max_mem: 100M
    max_file: 500M
    sync: always # 1s to 2min
}
Information for Consumer XANOLE_OUTPUT_VALUES > consumer_default created 2024-05-11T16:03:32+08:00

Configuration:

                    Name: consumer_default
        Delivery Subject: _INBOX.YOhDaj4BWhBtA4ueOQxq0x
          Filter Subject: xanole.outputs.xxx>
          Deliver Policy: All
     Deliver Queue Group: consumer_default
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
         Max Ack Pending: 1,000
            Flow Control: false

State:

  Last Delivered Message: Consumer sequence: 1,074,833 Stream sequence: 1,074,913 Last delivery: 423µs ago
    Acknowledgment Floor: Consumer sequence: 1,074,832 Stream sequence: 1,074,912 Last Ack: 2ms ago
        Outstanding Acks: 1 out of maximum 1,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
         Active Interest: Active using Queue Group xhouse_inputs_agents_default
derekcollison commented 5 months ago

After cutting the power, could you snapshot the store directory and share with us privately after reboot but before you restart the nats-server?

Also, what size payloads have you set the system to use? I see it is complaining they are over 8MB which we strongly discourage.

derekcollison commented 5 months ago

Or provide more information on number of messages inbound into the WQ, how big are they, how are the consumers operating? Does the stream usually have messages or do the consumers usually consumer messages pretty quickly and keep the number of messages very low or at zero?

bfoxstudio commented 5 months ago

same problem on kv tests and reboot container by drain docker node or reboot VM https://github.com/nats-io/nats-server/issues/5205

yuzhou-nj commented 5 months ago

I have encountered this problem. The sequence of consumer exceeds that of stream, so I cannot get new messages from stream until the sequence of stream exceeds that of consumer.

derekcollison commented 5 months ago

We are trying to reproduce, so any and all information on how you are interacting with the stream is helpful.

How fast messages are being sent into the system? How fast they are being consumed?

In general is the stream near empty or at empty most of the time?

katrinwab commented 5 months ago

in our case, 1 msg per sec was sent cluster has 5 node

we drained nodes with meta leader and stream leader waited 2 minute activated nodes again after all, consumer didn't see all message (etc some messages were pulling, but jetstream_consumer_num_pending was increasing)

katrinwab commented 5 months ago

also we have problems where we drain not leader node after all, jetstream_stream_last_seq in drained node is less than in not drained node that difference increases by time

wallyqs commented 5 months ago

Thanks everyone for sharing about the symptoms from this issue, currently we're looking at a reproducible example to trip this condition. @stitchcula I see in the logs that you have a warning of Maximum payloads over 8.00 MB, do you have an idea of the size of the messages that you are sending or whether persisted messages are close to 8MB?

bfoxstudio commented 5 months ago

Thanks everyone for sharing about the symptoms from this issue, currently we're looking at a reproducible example to trip this condition. @stitchcula I see in the logs that you have a warning of Maximum payloads over 8.00 MB, do you have an idea of the size of the messages that you are sending or whether persisted messages are close to 8MB?

We have a standard 1mb payloads. We test on version master -> 2.10.5 and always when restarting the docker server problem with consumers and problem with last sequence reproduce

stitchcula commented 5 months ago

In my testing, the size of payload is around 50KB, with a consumption rate of 100 messages per second. The consumers usually consumer messages pretty quickly and keep the number of messages very low.

To avoid incidents in production, I temporarily forked a branch and made an aggressive fix, where if the consumer initialization finds the ack floor higher than the stream's lastSeq, it forcibly resets to lastSeq.

This allows the consumer to consume normally, but data loss occurs. I believe the fundamental issue is that the persistence mechanism of the stream during a power outage can be improved to avoid losing the cursor.

stitchcula commented 5 months ago

snapshot the store directory and share with us privately after reboot but before you restart the nats-server

I have already taken a snapshot of the store directory after rebooting but before restarting the nats-server. How can I share it with you?

Thanks for the hard work!

wallyqs commented 5 months ago

@stitchcula you can send to wally@nats.io if possible?

derekcollison commented 4 months ago

We believe this is now fixed, and will be part of 2.10.17 release.

diorcety commented 1 month ago

I fall on the same exact issue when I tested the behaviour on power outage with 2.10.21.

Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.627588 [INF]   Starting restore for stream '$G > $MQTT_msgs'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.633985 [INF]   Restored 0 messages for stream '$G > $MQTT_msgs' in 6ms
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.634478 [INF]   Starting restore for stream '$G > $MQTT_out'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.635200 [INF]   Restored 0 messages for stream '$G > $MQTT_out' in 1ms
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.635734 [INF]   Starting restore for stream '$G > $MQTT_qos2in'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.636200 [INF]   Restored 0 messages for stream '$G > $MQTT_qos2in' in 0s
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.636755 [INF]   Starting restore for stream '$G > $MQTT_rmsgs'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.637255 [INF]   Restored 0 messages for stream '$G > $MQTT_rmsgs' in 0s
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.638113 [INF]   Starting restore for stream '$G > $MQTT_sess'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.638857 [INF]   Restored 72 messages for stream '$G > $MQTT_sess' in 1ms
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.639358 [INF]   Starting restore for stream '$G > metrics_cloud'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.640232 [INF]   Restored 26 messages for stream '$G > metrics_cloud' in 1ms
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.641628 [INF]   Starting restore for stream '$G > metrics_latest'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.643215 [INF]   Restored 1 messages for stream '$G > metrics_latest' in 2ms
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.644541 [INF]   Recovering 4 consumers for stream - '$G > $MQTT_rmsgs'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.644978 [WRN]     Error unmarshalling consumer metafile "/var/lib/nats-server/data/jetstream/$G/streams/$MQTT_rmsgs/obs/$MQTT_rmsgs_NC7S88QeqzzL8xGevQSyfD/meta.inf": unexpected end of JSON input
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.648010 [INF]   Recovering 1 consumers for stream - '$G > metrics_cloud'
Oct  8 10:28:19 me-x86-64-0001c036581f nats-server[8034]: [8034] 2024/10/08 10:28:19.648637 [WRN] Detected consumer '$G > metrics_cloud > X_supervision' ack floor 247 is ahead of stream's last sequence 26
diorcety commented 1 month ago

I will investigate in the coming days, but the issue seems from the fact that the writes of the stream states is interrupted in the middle, leading to a corrupted file/state. The new data is written over the old one? (Seems the case but not sure). If this is the case maybe writing the new meta/state in a new directory, then os.rename the new one over the old, will at least guarantee that the data is not corrupted.

derekcollison commented 1 month ago

Do you do lots of stream updates?

What does that file (meta.inf) look like, could you share with us?

diorcety commented 1 month ago

Not really, I have 1 stream in memory with 1 msg by second, and a copy in another stream on the disk of 1 message every 10 seconds. Less than 1k by message. The server is on "edge" with no power backup.

I will try to investigate and create an issue/pr according my findings, and try to share will you all the Jetstream store when the issue rises.

diorcety commented 3 weeks ago

I manage to reproduce it on my computer: Download bitnami virtualbox appliance (will allow to simulate a power outage): https://bitnami.com/stack/nats/virtual-machine Open port, enable jetstream in a /opt/bitnami/nats:

Create a stream: ./nats -s nats://192.168.YY.XX stream create --config 0001_latest.json

{
  "config": {
    "name": "metrics_latest",
    "subjects": [
      "metrics.>"
    ],
    "retention": "limits",
    "max_consumers": -1,
    "max_msgs_per_subject": 1,
    "max_msgs": -1,
    "max_bytes": -1,
    "max_age": 120000000000,
    "max_msg_size": -1,
    "storage": "file",
    "discard": "old",
    "num_replicas": 1,
    "duplicate_window": 120000000000,
    "sealed": false,
    "deny_delete": false,
    "deny_purge": false,
    "allow_rollup_hdrs": false,
    "allow_direct": true,
    "mirror_direct": false,
    "consumer_limits": {}
  }
}

Add some data: watch -n 0.1 ./nats -s nats://192.168.YY.XX pub metrics.TEST "AAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQAAAQQQQQQQQQQQQQQQQQ"

And try multiple times to do a restart of the VM

After few tries


[352] 2024/10/13 10:21:06.338321 [INF] Starting nats-server
[352] 2024/10/13 10:21:06.342972 [INF]   Version:  2.10.21
[352] 2024/10/13 10:21:06.342977 [INF]   Git:      [d3a8868]
[352] 2024/10/13 10:21:06.342980 [INF]   Name:     NDTMWG6VXQF66HSNFB5BO6HCP2VX5WOJ6ISSPITYXNFVFYT6NCHCIJYI
[352] 2024/10/13 10:21:06.342984 [INF]   Node:     cO2FmMr9
[352] 2024/10/13 10:21:06.342986 [INF]   ID:       NDTMWG6VXQF66HSNFB5BO6HCP2VX5WOJ6ISSPITYXNFVFYT6NCHCIJYI
[352] 2024/10/13 10:21:06.342992 [INF] Using configuration file: /opt/bitnami/nats/conf/nats-server.conf
[352] 2024/10/13 10:21:06.344746 [INF] Starting http monitor on 0.0.0.0:8222
[352] 2024/10/13 10:21:06.345571 [INF] Starting JetStream
[352] 2024/10/13 10:21:06.345638 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[352] 2024/10/13 10:21:06.345641 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[352] 2024/10/13 10:21:06.345643 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[352] 2024/10/13 10:21:06.345645 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[352] 2024/10/13 10:21:06.345646 [INF]
[352] 2024/10/13 10:21:06.345648 [INF]          https://docs.nats.io/jetstream
[352] 2024/10/13 10:21:06.345649 [INF]
[352] 2024/10/13 10:21:06.345651 [INF] ---------------- JETSTREAM ----------------
[352] 2024/10/13 10:21:06.345655 [INF]   Max Memory:      953.67 MB
[352] 2024/10/13 10:21:06.345657 [INF]   Max Storage:     14.90 GB
[352] 2024/10/13 10:21:06.345659 [INF]   Store Directory: "/opt/bitnami/nats/jetstream/jetstream"
[352] 2024/10/13 10:21:06.345660 [INF] -------------------------------------------
[352] 2024/10/13 10:21:06.348814 [INF]   Starting restore for stream '$G > metrics_latest'
[352] 2024/10/13 10:21:06.350048 [WRN] Filestore [metrics_latest] Stream state outdated, last block has additional entries, will rebuild
[352] 2024/10/13 10:21:06.350054 [WRN] Filestore [metrics_latest] Recovering stream state from index errored: prior state file
[352] 2024/10/13 10:21:06.356364 [INF]   Restored 0 messages for stream '$G > metrics_latest' in 8ms
[352] 2024/10/13 10:21:06.356426 [INF] Listening for client connections on 0.0.0.0:4222
[352] 2024/10/13 10:21:06.356498 [ERR] Address "0.0.0.0" can not be resolved properly
[352] 2024/10/13 10:21:06.356504 [INF] Server is ready

A backup of jetstream just before restarting it: https://file.io/fSzkRNDrDXqM

With multiple streams and consumers there are other consequences:

[4602] 2024/10/13 12:00:11.931818 [INF] Starting JetStream
[4602] 2024/10/13 12:00:11.931949 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[4602] 2024/10/13 12:00:11.931971 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[4602] 2024/10/13 12:00:11.931974 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[4602] 2024/10/13 12:00:11.931975 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[4602] 2024/10/13 12:00:11.931977 [INF]
[4602] 2024/10/13 12:00:11.931978 [INF]          https://docs.nats.io/jetstream
[4602] 2024/10/13 12:00:11.931980 [INF]
[4602] 2024/10/13 12:00:11.932000 [INF] ---------------- JETSTREAM ----------------
[4602] 2024/10/13 12:00:11.932023 [INF]   Max Memory:      953.67 MB
[4602] 2024/10/13 12:00:11.932028 [INF]   Max Storage:     14.90 GB
[4602] 2024/10/13 12:00:11.932030 [INF]   Store Directory: "/tmp/nats-test/jetstream"
[4602] 2024/10/13 12:00:11.932032 [INF]   Domain:          XX-system
[4602] 2024/10/13 12:00:11.932034 [INF]   API Level:       1
[4602] 2024/10/13 12:00:11.932056 [INF] -------------------------------------------
[4602] 2024/10/13 12:00:11.932413 [INF]   Starting restore for stream '$G > $MQTT_msgs'
[4602] 2024/10/13 12:00:11.932668 [INF]   Restored 0 messages for stream '$G > $MQTT_msgs' in 0s
[4602] 2024/10/13 12:00:11.932752 [INF]   Starting restore for stream '$G > $MQTT_out'
[4602] 2024/10/13 12:00:11.932948 [INF]   Restored 0 messages for stream '$G > $MQTT_out' in 0s
[4602] 2024/10/13 12:00:11.933021 [INF]   Starting restore for stream '$G > $MQTT_qos2in'
[4602] 2024/10/13 12:00:11.934186 [INF]   Restored 0 messages for stream '$G > $MQTT_qos2in' in 1ms
[4602] 2024/10/13 12:00:11.934259 [INF]   Starting restore for stream '$G > $MQTT_rmsgs'
[4602] 2024/10/13 12:00:11.934429 [INF]   Restored 0 messages for stream '$G > $MQTT_rmsgs' in 0s
[4602] 2024/10/13 12:00:11.934498 [INF]   Starting restore for stream '$G > $MQTT_sess'
[4602] 2024/10/13 12:00:11.934698 [INF]   Restored 74 messages for stream '$G > $MQTT_sess' in 0s
[4602] 2024/10/13 12:00:11.934771 [INF]   Starting restore for stream '$G > metrics_cloud'
[4602] 2024/10/13 12:00:11.934898 [WRN] Filestore [metrics_cloud] Stream state outdated, last block has additional entries, will rebuild
[4602] 2024/10/13 12:00:11.934920 [WRN] Filestore [metrics_cloud] Recovering stream state from index errored: prior state file
[4602] 2024/10/13 12:00:11.938204 [INF]   Restored 0 messages for stream '$G > metrics_cloud' in 3ms
[4602] 2024/10/13 12:00:11.938283 [INF]   Starting restore for stream '$G > metrics_latest'
[4602] 2024/10/13 12:00:11.938416 [WRN] Filestore [metrics_latest] Stream state outdated, last block has additional entries, will rebuild
[4602] 2024/10/13 12:00:11.938438 [WRN] Filestore [metrics_latest] Recovering stream state from index errored: prior state file
[4602] 2024/10/13 12:00:11.948774 [INF]   Restored 0 messages for stream '$G > metrics_latest' in 10ms
[4602] 2024/10/13 12:00:11.948856 [INF]   Recovering 2 consumers for stream - '$G > $MQTT_rmsgs'
[4602] 2024/10/13 12:00:11.948906 [WRN]     Error unmarshalling consumer metafile "/tmp/nats-test/jetstream/$G/streams/$MQTT_rmsgs/obs/$MQTT_rmsgs_NC7S88QeqzzL8xGevQSyfD/meta.inf": unexpected end of JSON input
[4602] 2024/10/13 12:00:11.949450 [INF]   Recovering 1 consumers for stream - '$G > metrics_cloud'
[4602] 2024/10/13 12:00:11.949608 [WRN] Detected consumer '$G > metrics_cloud > XX_supervision' ack floor 16196 is ahead of stream's last sequence 0
[4602] 2024/10/13 12:00:11.949832 [INF] Listening for websocket clients on ws://0.0.0.0:1443
diorcety commented 3 weeks ago

About the last log, the issue seems to come from that metrics_cloud is s2 compressed. I modified the rebuildState call by the following one:

-       ld, tombs, _ := mb.rebuildState()
+       ld, tombs, err := mb.rebuildState()
+       if err != nil {
+               return mb, err
+       }

Then the error is this one: Error recreating stream "metrics_cloud": error reading compression reader: s2: unsupported input (10077)

Not sure how is partial write failure tolerant the "uncompressed" format, but doesn't seem the case for s2.

diorcety commented 3 weeks ago

I continued the investigation: The stream have one consumer, which is pulled continiously. What happen:

The problem with "ZZZZ is ahead of stream's last sequence 0" is due to that behaviour. The state of the consumer "o.dat" is flushed very quickly (open/write/close) which is not the case for the message block where some parts are written but not flushed: the close happens only when the file will be deleted. If the power outtage happen between two message, the data of stream may be lost but very less likely the state of consumer.

diorcety commented 3 weeks ago

strace.txt

neilalexander commented 3 weeks ago

@diorcety Can you please confirm whether you are also testing this with sync: always as in the original post? If not, can you please set it and see if you can reproduce?

jetstream {
    sync: always
}

I have a feeling I know what's going on but this should confirm it either way.

diorcety commented 3 weeks ago

@neilalexander Indeed that "fix" my issue. Nevertheless the behavior without "sync: always" seems "incorrect". The sync timer should flush/sync simultanously all the states (streams and consumers) and no other sync outside that timer, no?

Edit: losing few messages maybe an issue, in this case sync always seems to be the way to minimize that issue. But if it is not the case, recovering a good state without few messages should be the behaviour.

diorcety commented 3 weeks ago

I made futher investigations. The issue mainly appears when the stream is empty.

Status: The stream index is outdated (the last block referenced inside the index is already processed and removed), and the new block file is created by it's empty. the consumer state contains a sequence id which is maybe not the last but newer than the one in stream index.

The consumer have an ack sequence id number, but in this case the only sequence available for the stream is saved in the index file, index which will be often outdated. There is no way to know stream message sequence number, because the new block is empty. It supposed to contain a tombstone but tumbstone are not synced.

At a sync point you may have a correct data in stream and consumer, then: 1) The remaining message is processed 2) The message block is removed 3) A new one is created with a tombstone but without sync. Seems that in this case, the remove fs action is persisted by not the tombstone write: at restart I have only a empty message block, Inode file system operations are more quickly synced?

I'm currently experimenting modifications:

https://github.com/diorcety/nats-server/commit/62dadc5b7c29a639c6448082c7515dda565a681b Avoid to overwrite a file, always create a new one and rename it. which will avoid incorrect content, notably when compression is used where the truncation is not only a simple file truncation

https://github.com/diorcety/nats-server/commit/4b665151c794f60641a8f7487fe546a18ba55205 Always sync tombstone. When the last msgblk is removed a new one is directly created with a tombstone, containing stream information. If this information is lost due to power outage for example, we reach the state of this issue

https://github.com/diorcety/nats-server/commit/305cd4e8b04e41e4bbbbf04364d3a42986fce7b2 Sync consumer state at same time that other syncs. Before this change, the state of consumer was never explicitly synced if SyncAlways was not set

I easily reproduced this issue using my platform watchdog (almost everytimes), with these modifications, no more issues after 10 tries. Nats, streams and consumers are correctly restored

diorcety commented 3 weeks ago

trace.txt Before 12:48:19.945151, my customer puller is not started. We can see that only during the periodic sync the files are synced. Then my consumer continuously pulls messages, you can see that after the first write in a new message block file (the tombstone), the file is synced