Closed MauriceVanVeen closed 2 years ago
I am trying to write a server test that would reproduce the issue. This error means that the snapshot the node receives has a first sequence that is lower than the first sequence in its store, so the node needs to clear the state. Suppose leader has a snapshot that says first sequence is 100 and last is 1000, but the follower that receives this snapshot has its first sequence being 150, then it assumes that it is missing messages 100 to 149 and due to file store being a message log (append-only), then it needs to wipe it states and sync with the leader. My suspicion is because of message TTL, there is a race where this snapshot is actually no longer valid in that the follower may have already discarded "old" messages? So far I have not been able to write a "simple" server test that demonstrates the issue. I will update when I am able to do so.
Sorry, I posted link to wrong issue in PR https://github.com/nats-io/nats-server/issues/3519. Reopening this one.
I'm not sure if that's part of this issue, but at least related due to the snapshots.
When repeating requests to the /healthz
endpoint it flip-flops quite often from status: ok
to status: unavailable
with some stream or consumer not being current. Could this be because of those snapshots as well? Which would then also result in this First sequence mismatch upon restarting?
Have you run nats traffic
as a system user to watch the stability of the system? If so what did it report? Did you see lots of vote traffic?
@derekcollison not too sure how to read the tables, but haven't seen any Proposal
, Vote
or Remove Peer
numbers.
Here is a copy after running nats traffic
for a while:
╭──────────────────────────────────────────────────────────────────╮
│ Raft Traffic │
├──────────┬──────┬────────┬─────────────┬────────┬────────────────┤
│ Proposal │ Vote │ Append │ Remove Peer │ Reply │ Total Messages │
├──────────┼──────┼────────┼─────────────┼────────┼────────────────┤
│ 0 │ 0 │ 9,281 │ 0 │ 18,563 │ 27,844 │
│ 0 │ 0 │ 11,362 │ 0 │ 22,728 │ 34,090 │
│ 0 │ 0 │ 8,647 │ 0 │ 17,294 │ 25,941 │
│ 0 │ 0 │ 10,775 │ 0 │ 21,544 │ 32,319 │
│ 0 │ 0 │ 9,853 │ 0 │ 19,711 │ 29,564 │
│ 0 │ 0 │ 9,080 │ 0 │ 18,161 │ 27,241 │
│ 0 │ 0 │ 7,795 │ 0 │ 15,590 │ 23,385 │
│ 0 │ 0 │ 9,153 │ 0 │ 18,242 │ 27,395 │
│ 0 │ 0 │ 11,755 │ 0 │ 23,574 │ 35,329 │
│ 0 │ 0 │ 10,947 │ 0 │ 21,893 │ 32,840 │
│ 0 │ 0 │ 12,798 │ 0 │ 25,597 │ 38,395 │
╰──────────┴──────┴────────┴─────────────┴────────┴────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────╮
│ Cluster Traffic │
├────────────┬─────────────┬───────────────┬─────────────┬───────┬────────────────┤
│ JSA Update │ Stream Info │ Consumer Info │ Stream Sync │ Reply │ Total Messages │
├────────────┼─────────────┼───────────────┼─────────────┼───────┼────────────────┤
│ 23 │ 0 │ 0 │ 0 │ 0 │ 23 │
│ 21 │ 1 │ 0 │ 0 │ 0 │ 22 │
│ 24 │ 0 │ 0 │ 0 │ 0 │ 24 │
│ 19 │ 1 │ 0 │ 0 │ 0 │ 20 │
│ 24 │ 0 │ 0 │ 0 │ 0 │ 24 │
│ 21 │ 0 │ 0 │ 0 │ 0 │ 21 │
│ 20 │ 1 │ 0 │ 0 │ 0 │ 21 │
│ 22 │ 0 │ 0 │ 0 │ 0 │ 22 │
│ 24 │ 1 │ 0 │ 0 │ 0 │ 25 │
│ 22 │ 0 │ 0 │ 0 │ 0 │ 22 │
│ 22 │ 1 │ 0 │ 0 │ 0 │ 23 │
╰────────────┴─────────────┴───────────────┴─────────────┴───────┴────────────────╯
╭───────────────────────────────────────────────────────────────────────────╮
│ General Traffic │
├──────────┬────────┬────────┬────────┬──────┬─────────────┬────────────────┤
│ Requests │ JS API │ JS ACK │ System │ Rest │ Total Bytes │ Total Messages │
├──────────┼────────┼────────┼────────┼──────┼─────────────┼────────────────┤
│ 0 │ 0 │ 0 │ 0 │ 0 │ 3.1 MiB │ 0 │
│ 0 │ 1 │ 0 │ 9 │ 1 │ 3.5 MiB │ 11 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 2.4 MiB │ 0 │
│ 0 │ 1 │ 0 │ 11 │ 1 │ 3.3 MiB │ 13 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 2.8 MiB │ 0 │
│ 0 │ 0 │ 0 │ 2 │ 0 │ 2.6 MiB │ 2 │
│ 0 │ 1 │ 0 │ 9 │ 1 │ 2.8 MiB │ 11 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 3.0 MiB │ 0 │
│ 0 │ 1 │ 0 │ 9 │ 1 │ 3.5 MiB │ 11 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 3.2 MiB │ 0 │
│ 0 │ 1 │ 0 │ 9 │ 1 │ 4.1 MiB │ 11 │
╰──────────┴────────┴────────┴────────┴──────┴─────────────┴────────────────╯
How many consumers do you have running?
4 consumers currently
A single stream? How many inbound messages to the streams? The append column looks high.
We currently have three streams, but two of them are mostly used. (The last one, not mentioned here, only has about 10-50 messages/s, with the same settings as the other two regarding replication, file and TTL)
A nats str report
provides the following (adapted the stream names a bit to not expose the full name):
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Stream Report │
├────────────────────┬─────────┬───────────┬───────────┬────────────┬─────────┬──────┬─────────┬─────────────────────────┤
│ Stream │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas │
├────────────────────┼─────────┼───────────┼───────────┼────────────┼─────────┼──────┼─────────┼─────────────────────────┤
│ diff │ File │ │ 1 │ 3,687,112 │ 3.2 GiB │ 0 │ 0 │ nats-0, nats-1*, nats-2 │
│ history │ File │ │ 1 │ 19,720,560 │ 6.8 GiB │ 0 │ 0 │ nats-0, nats-1*, nats-2 │
╰────────────────────┴─────────┴───────────┴───────────┴────────────┴─────────┴──────┴─────────┴─────────────────────────╯
Some consumer reports:
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Consumer report for diff with 1 consumers │
├───────────────────────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────────┤
│ Consumer │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster │
├───────────────────────────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────────────────┤
│ area-monitor_0 │ Push │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 357,861,029 │ nats-0, nats-1*, nats-2 │
╰───────────────────────────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Consumer report for history with 1 consumers │
├──────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬───────────────┬─────────────────────────┤
│ Consumer │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster │
├──────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼───────────────┼─────────────────────────┤
│ rabbitmq │ Push │ Explicit │ 30.00s │ 99 │ 0 │ 0 │ 2,053,581,370 │ nats-0, nats-1, nats-2* │
╰──────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴───────────────┴─────────────────────────╯
(this consumer reads messages from NATS and converts them to another format and forwards it to RabbitMQ, which is what we used prior to moving to NATS)
ok that would explain the higher append numbers.
FYI: I was able to reproduce the "first sequence mismatch" resulting in resetting the cluster using manual tests. With debugging, I think I can understand what is going on and will later try to write a unit test that can reproduce. I am not sure there is much we can do to prevent that, but we will discuss internally with @derekcollison to see if there is a way to mitigate.
Ok, I have a server test case that reproduces the issue. I will discuss with @derekcollison next week to see what are the next steps. Thank you for your patience.
Awesome, thank you! :tada:
I can confirm that we are seeing this issue as well. Thanks for reporting it @MauriceVanVeen and for fixing it @kozlovic.
@jocrau I did not fix (just wrote the server test that demonstrated the issue). @derekcollison is the one who fixed it :-)
@derekcollison @kozlovic started to deploy the new version 2.9.4, and although the first sequence mismatch doesn't occur anymore. Another issue has taken its place, and it's not able to restore itself anymore. It continuously logs the below messages, until it eventually exits the server and tries again (infinitely).
It continuously logs these messages:
Catchup for stream 'X > Y' errored, will retry: wrong sequence for skipped msg
Catchup for stream 'X > Y' errored, will retry: expected sequence does not match store
Attached are the logs of this particular instance when deploying the new version. Will revert back to version 2.9.3 for now. nats-2-logs.txt
We do not suggest reverting going backwards.
Have you tried backing up the stream and restoring? Or scaling down to 1 replica and then scaling back up?
Have you considered giving more resources to the nats-server? Does your monitoring show you how much memory the system is using at steady state and on startup?
Could you share a tree output of the storage directories with us privately? Would it be possible to get access to all the storage directories to possibly try to recreate here?
Have shared the directories privately with Derek. Did manage to get those 3 instances to run 2.9.4 now.
Since upgrading to 2.9.5 it all started working perfectly, thanks again! (we've upgraded to 2.9.6 already, still working like a charm)
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
output*Note: trace logging not enabled, since to reproduce this issue messaging needs to be constant (both adds and deletes), and this spams the logged output. Debug logging is included.
Versions of
nats-server
and affected client libraries used:nats:2.9.2-alpine
natsio/nats-box:0.13.0
(no change made in Helm chart)OS/Container environment:
c6i.xlarge
machines, clustered with 3 nodesr5.large
)nats: image: nats:2.9.2-alpine
logging: debug: true
jetstream: enabled: true
resources: requests: cpu: 3 memory: 6Gi limits:
cpu: 4
Sets GOMEMLIMIT environment variable which makes the Go GC be aware of memory limits
from the container. Recommended to be set to about 90% of the resource memory limits.
gomemlimit: 7GiB
nats str add benchstream --storage=file --replicas=3 --retention=limits --max-age=1m --max-bytes=-1 --max-consumers=-1 --max-msg-size=-1 --max-msgs=-1 --max-msgs-per-subject=-1 --discard=old --dupe-window=1m --subjects=js.bench --allow-rollup --deny-delete --no-deny-purge
nats bench js.bench --pub=1 --msgs=10000000 --pubsleep=0.00001s --size=1KB --pubbatch=1000
[7] 2022/10/03 13:43:30.411232 [WRN] Error applying entries to '$G > benchstream': first sequence mismatch [7] 2022/10/03 13:43:30.414851 [WRN] Resetting stream cluster state for '$G > benchstream'