Closed sourabhaggrawal closed 1 year ago
All nodes started throwing below WARN for different streams after restarting all node one after another.
Wrong index, ae is &{leader:8uhmr4oG term:121 commit:42723 pterm:121 pindex:42888 entries: 1}, index stored was 44229, n.pindex is 42888
Does your stream have AllowDirect set to True?
Yes @derekcollison all streams were created with allow direct true
For stream_360 I would do a nats stream cluster step-down stream_360
and see if that resolves.
If it does not, you could scale that stream down to 1 and back up to 3.
nats s update stream_360 --replicas 1 -f
and then..
nats s update stream_360 --replicas 3 -f
If you want to review operation before executing remove the -f
Hi @derekcollison
I could try this but I was getting this error for many streams.
Now I changed the cluster node type from m5.4xlarge to c5.4xlarge to run the same setup and I don't see this issue there. Restarted 2-3 times in middle of the load and cluster came back to healthy, there were some logs of stream quorum lost but they recovered soon, I noticed one issue related to "nats server report jetstream " command which I have raised here . This worked for me now, I will try your steps also on m5.4xlarge instance again.
I wonder if you were saturating the network? Do you monitor the network between nats-servers?
m5.4xlarge and c5.4xlarge node is supporting upto 10Gbit/s as per AWS documentation. I am exploring the ways to figure out network congestion, but from iftop command on all nodes I could see transfer and receive rate was around 150Mbps for both, 300Mb in total. It does not look like a network congestion I guess. But this result is from c5.4xlarge machine. I can give another try on m5.4xlarge machine and share the result.
"up to" is mostly a lie, see https://twitter.com/dvassallo/status/1120171727399448576?lang=en and https://cloudonaut.io/ec2-network-performance-cheat-sheet/
Any updates here? Any data on network throughput and utilization while this was being experienced? Is issue still happening?
Hi @derekcollison did not get a chance to run this again. It does not occur on c5.4xlarge machine but soon gonna try this again on m5.4xlarge machine on which the issue was noticed.
I would recommend m5n.8xlarge..
Going to close this but feel free to re-open if needed.
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:2.9.6
OS/Container environment:
Ubuntu 20.04.4
Steps or code to reproduce the issue:
Our Performance Test cluster is 5 node cluster on aws m5.4xlarge machine, with jetstream max memory configured as 60GB.
We are creating 500 R3 file streams with wildcards, and publishing to unique 320K subjects in parallel. We are consuming messages by calling the GetMsg of jetstream. After a while we started getting the error.
Also after restarting the same node again quorum for most stream got lost and cluster leader was repeatedly getting re-elected.