Open ewirch opened 1 year ago
We are aware of this issue and working on completing a fix that will roll out for 2.9.16.
I also encountered the same situation, where the issue reliably occurs when power off and restart. After restarting, all consumers are unable to self-recover and fail to receive messages. The related issue is https://github.com/nats-io/nats-server/issues/4566 . I believe this is a server-side problem and should not be circumvented by client-side operations like deleting and re-adding consumers.
I conducted power-off tests on the following versions, and all exhibited the issue:
I tested various sync time periods, including 'sync: always', but it doesn't seem to have much effect.
jetstream {
store_dir: /userdata/nats
max_mem: 100M
max_file: 500M
sync: always # 1s to 2min
}
The following information will appear in the logs:
[INF] Starting restore for stream '$G > TEST_STREAM'
[1471] 2024/05/11 16:11:09.915294 [WRN] Filestore [TEST_STREAM] Stream state detected prior state, could not locate msg block 1244
[1471] 2024/05/11 16:11:09.936155 [INF] Restored 0 messages for stream '$G > TEST_STREAM' in 77ms
and then:
[WRN] Detected consumer '$G > TEST_STREAM > test_consumer' ack floor 14084 is ahead of stream's last sequence 0
It appears that the blk index in index.db actually cannot be found on the disk.
This issue is beyond my ability to fix quickly. I earnestly seek assistance. @derekcollison @bruth
Can you describe a bit more on the details of your testing?
How do you perform power off? Is it for all servers or just some? How do you restart the servers / cluster?
Can you describe a bit more on the details of your testing?
How do you perform power off? Is it for all servers or just some? How do you restart the servers / cluster?
I conducted tests using a single-node NATS JetStream running on a standalone server, and the method of restarting was to directly cut off the server's power supply. After the restart, it was almost 100% likely to reproduce the situation described in the issue, where the consumer's ack floor is greater than the stream's last sequence, resulting in all consumers being unable to consume. I have used multiple servers for testing, and have tested with various file systems (xfs, ext4) and types of disks (nano flash, SSD, mechanical hard disk) to rule out the possibility of randomness. I can provide more logs and information when I come to work tomorrow.
Thanks that is helpful. Will try to recreate and see if we can improve.
Thanks that is helpful. Will try to recreate and see if we can improve.
Because there is a lot of information, I have reorganized it into a new issue https://github.com/nats-io/nats-server/issues/5412. This situation has occurred frequently in our production environment. I am seeking help, thank you very much.
Are you a Synadia customer?
@stitchcula could you try top of main?
Defect
nats-server -DV
outputSorry, I can't provide an example to reproduce the case. I don't know how this happens. With a little help from you devs I might be able to collect enough information to understand the problem.
We have multiple streams, which report a
FirstSeq
ID, which is far behind any existing consumer "Ack Floor". Also the ID does not exist any more.FirstSeq
is 45.Trying to retrieve the message fails:
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:nats-server: 2.9.15 (it is possible that the situation was created by an older version (2.8), but the update to 2.9.15 didn't solve it either)
OS/Container environment:
Google Kubernetes Engine 1.24.10-gke.1200
Expected result:
FirstSeq
is moved to the oldest existing message, not yet acknowledged by at least one consumer.Actual result:
FirstSeq
is stale, pointing to a non-existing message.