Open juan131 opened 1 year ago
We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.
In the meantime, you have two general approaches to repairing the raft layer.
/jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true
. Then issue nats consumer cluster step-down <stream> <consumer>
nats stream update --replicas=1 <stream>
Thanks so much for the information @derekcollison !! I hope the new release helps mitigating these issues 🤞🏼
A couple of follow-up questions:
Feel free to test the RC.7 candidate for 2.9.16, it is under synadia's docker hub for nats-server
.
In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for /haelthz
to return ok, then move to the next server.
Thanks @derekcollison ! I'll give the new version a try.
One odd thing that I noticed is that, sometimes, a consumer gets "stuck" when one of the NATS servers is restarted (usually when it's the leader the one that's restarted). I mean, no messages are delivered to its subscribers even though there are millions of messages pending in the stream.
Running nats consumer cluster step-down <stream> <consumer>
doesn't help. However, it does help removing the consumer (nats consumer cluster rm <stream> <consumer>
), create it again & recreate the subscriptions.
Note: I use explicit ack push consumers with a queue group of subscribers. The subscriptions are "channel-based" created using ChanQueueSubscribe function.
Do you have any explanation for this behaviour? Is there anything I can do to detect a consumer is "stuck"?
Related to the behaviour I mentioned above. I noticed in the Grafana dashboard I have for monitoring JetStream (based on this dashboard) that, when this happens, the number of messages with ACK pending soars for that consumer, see:
How can this happens if I'm setting Max Ack Pending: 100
on the consumer?
$ nats consumer info FOO-queue FOO-BAR-workers
Configuration:
Name: FOO-BAR-workers
Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
Filter Subject: FOO.BAR
Deliver Policy: All
Deliver Queue Group: FOO-BAR-workers
Ack Policy: Explicit
Ack Wait: 10s
Replay Policy: Instant
Maximum Deliveries: 10
Max Ack Pending: 100
Flow Control: false
Cluster Information:
Name: nats
Leader: nats-0
Replica: nats-1, current, seen 0.00s ago
Replica: nats-2, current, seen 0.00s ago
We would have to triage your specific setup to gain more insight.
You do have the ability to set max ack pending which we highly recommend doing for systems.
Hi @derekcollison , we experienced the same issue on our side, but in our case, we use 5 replicas for the workqueue stream, and in the latest version 2.9.19
In addition, we got one message "lost" in the queue, during this period where our cluster was unavailable.
So you had log messages about resetting WAL? For which servers? What did stream info report for the stream?
Our WorkQueue streams leader was moved to a different instance right after that, but messages were lost in the transaction
Meaning messages that were in the work queue were lost on a leader transfer?
How can I make sure of that? I only can tell the message was successfully sent by our NATs client and after the leader restart message was gone from every available replica.
ok and the message had not been properly delivered etc yes?
Could it be that the problem is with the consumer running out of number of deliveries because of being a push consumer if there's no-one to listening to the delivery subject at the time (because one of the server's being restarted, some client applications are being disconnected and need to reconnect to other servers)? And the warning messages you are seeing on the server are related to the server restarting but just warnings?
I am saying this because of your consumer info:
Name: FOO-BAR-workers
Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
Filter Subject: FOO.BAR
Deliver Policy: All
Deliver Queue Group: FOO-BAR-workers
Is it a push consumer with a queue group and your delivery subject is an _INBOX subject? That would not be the right thing to do.
I would recommend using pull consumers rather than push consumer on those working queue streams
we are already using pull consumers since we identified in the past issues with the push consumers as you mentioned:
sub, err = s.dialer.JetStream.PullSubscribe(subject, s.subCfg.Name, s.subCfg.PullSubOpts()...)
if err != nil {
break
}
We would need help being able to reproduce this because with an R3 consumer on an R3 stream you should be able to have one server down and still operate. We would need more details about the failure scenario for example, is it just one of the servers going down and then coming back up later or more than one server being down? Can it access its store_dir
right away (each server must have its own independent storage directory and be able to access it). If there's any chance that more than one pod is restarting and there's overlap between their outage time consider using a 5+ nodes cluster and R5 for better fault-tolerance.
The reseting WAL state warning can happen for many reasons when you have servers going up and down and trying to recover their state through RAFT votes but should not result in lost messages.
For deployment over k8s we strongly recommend using the Helm chart (https://github.com/nats-io/k8s/releases/tag/nats-1.0.0-rc.1)
we experience the same using offical helm chart; in k8s we dont see any recent restarts or useful logs either. so im not sure how to debug it.
im betting though it's a memory/resource limit issue of some sort
Defect
NATS servers logs are filled with warning such as the one below and I'm unable to find any documentation about what's the cause for this:
Versions of
nats-server
and affected client libraries used:2.9.15
OS/Container environment:
Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named
nats-0
,nats-1
&nats-2
). Each replica has its own PVC.NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:
There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:
Steps or code to reproduce the issue:
The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn't stop (I can see 20K logs like this one in the last 12 hours for instance).
Expected result:
The system should tolerate the loss of one NATS server according to JetStream documentation given we're using a replication factor of 3.
Actual result:
Some streams are totally unusable when this happens (publishers can't add new message & subscriber don't receive new messages) while other streams seem to be working as expected.