nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.77k stars 1.4k forks source link

JetStream last sequence mismatch error keeps occuring after every restart/deployment #3644

Closed sourabhaggrawal closed 1 year ago

sourabhaggrawal commented 1 year ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

2.9.6

OS/Container environment:

Ubuntu 20.04.4

Steps or code to reproduce the issue:

Our Performance Test cluster is 5 node cluster on aws m5.4xlarge machine, with jetstream max memory configured as 60GB.

We are creating 500 R3 file streams with wildcards, and publishing to unique 320K subjects in parallel. We are consuming messages by calling the GetMsg of jetstream. After a while we started getting the error. Screenshot 2022-11-17 at 1 54 28 PM

Also after restarting the same node again quorum for most stream got lost and cluster leader was repeatedly getting re-elected.

sourabhaggrawal commented 1 year ago

All nodes started throwing below WARN for different streams after restarting all node one after another.

Wrong index, ae is &{leader:8uhmr4oG term:121 commit:42723 pterm:121 pindex:42888 entries: 1}, index stored was 44229, n.pindex is 42888

derekcollison commented 1 year ago

Does your stream have AllowDirect set to True?

sourabhaggrawal commented 1 year ago

Yes @derekcollison all streams were created with allow direct true

derekcollison commented 1 year ago

For stream_360 I would do a nats stream cluster step-down stream_360 and see if that resolves.

If it does not, you could scale that stream down to 1 and back up to 3.

nats s update stream_360 --replicas 1 -f

and then..

nats s update stream_360 --replicas 3 -f

If you want to review operation before executing remove the -f

sourabhaggrawal commented 1 year ago

Hi @derekcollison
I could try this but I was getting this error for many streams.

Now I changed the cluster node type from m5.4xlarge to c5.4xlarge to run the same setup and I don't see this issue there. Restarted 2-3 times in middle of the load and cluster came back to healthy, there were some logs of stream quorum lost but they recovered soon, I noticed one issue related to "nats server report jetstream " command which I have raised here . This worked for me now, I will try your steps also on m5.4xlarge instance again.

derekcollison commented 1 year ago

I wonder if you were saturating the network? Do you monitor the network between nats-servers?

sourabhaggrawal commented 1 year ago

m5.4xlarge and c5.4xlarge node is supporting upto 10Gbit/s as per AWS documentation. I am exploring the ways to figure out network congestion, but from iftop command on all nodes I could see transfer and receive rate was around 150Mbps for both, 300Mb in total. It does not look like a network congestion I guess. But this result is from c5.4xlarge machine. I can give another try on m5.4xlarge machine and share the result.

ripienaar commented 1 year ago

"up to" is mostly a lie, see https://twitter.com/dvassallo/status/1120171727399448576?lang=en and https://cloudonaut.io/ec2-network-performance-cheat-sheet/

derekcollison commented 1 year ago

Any updates here? Any data on network throughput and utilization while this was being experienced? Is issue still happening?

sourabhaggrawal commented 1 year ago

Hi @derekcollison did not get a chance to run this again. It does not occur on c5.4xlarge machine but soon gonna try this again on m5.4xlarge machine on which the issue was noticed.

derekcollison commented 1 year ago

I would recommend m5n.8xlarge..

derekcollison commented 1 year ago

Going to close this but feel free to re-open if needed.