Message Delivery - the Leader skips the messages

Xswer commented 3 years ago

We are experiencing a strange NATS behaviour: 1) Publisher-Client publishes a message to NATS without any errors 2) Consumer-Client can not receive the message, as if the message was never sent. Consumer uses durable queue group subscription. 3) Further and previous messages are published and consumed as usual.

What helps:

in some cases the problem could be solved via creating a new durable queue group with 'deliverAllAvailable'. In that case these "skipped" messages appear for the consumers in a new group. Sometimes that does not help.
if the first point does not help, then restart of the leader/leader change solves the problem - after leader restart we could create a new durable queue group with 'deliverAllAvailable' and only then the "skipped" messages could be received by the consumers.

The problem consistently appears on at least 2 channels. We assume, that the problem could be caused by corrupt raft logs. Could that be the cause?

Why we came to such assumption - we had a false configured redeploy pipeline, that allowed blue/green deployment. That means, that two instances could reference the same file store for a short period of time.

Could you please tell, whether simultaneous access to the raft log of 2 instances could cause raft log corruption? Could corrupted Raft log cause the described behaviour, when the newly published messages are not delivered to clients?

Thank you in advance!

kozlovic commented 3 years ago

Could you please tell, whether simultaneous access to the raft log of 2 instances could cause raft log corruption?

Definitively.

Could corrupted Raft log cause the described behaviour

Not sure, but once you solved your deploy issue and this situation (raft.log corruption) is guaranteed not to exist (by the way, you probably need to wipe all your state since we don't know what got corrupted), do you experience this issue? If not, then there is a good indication that yes, the corruption did cause this behavior.

But it could also be as simple as that the queue group is stalled, that is, each member of the group received their max inflight messages and are not properly ack'ing the messages, or acks are lost for some reason (network issues, etc..).

Creating a new group would then work because well, it is a different group, so server would deliver messages to this new group, but if this is the same application code and say there is a problem of not ack'ing messages, then this new group would start to behave as the old one.

It could also be that messages were delivered but did not reach the app (again, could be network, or issue with client library that does not set pending limits correctly and messages are dropped by NATS because of those limits). In that case messages should be redelivered, but that is based on member's AckWait, so check on that. If the AckWait is very high then it will take a long time before message is redelivered.

I know that you are using monitoring, so you should check for "stalled" state of those queue members. If they are not, and you still experience issues, then it is more an indication of the log corruption.

Xswer commented 3 years ago

@kozlovic thank you for your response and recommendations. The Subscriptions were not stalled, so messages should have been processed as usual, based on what I have seen in monitoring. AckWait is configured to be 30s, so that should not cause such behaviour as well.

At this point we would like to try to wipe the existing state. Could you please recommend an approach, that would allow wiping the state without down time of the cluster and dependent systems? I assume, that it could be reached via redeploying peers with lower store_limits.max_age parameter (at the moment is 336h => temporarily set lower to 3h), but again that could lead to differences in Raft Logs between the peers.

kozlovic commented 3 years ago

At this point we would like to try to wipe the existing state. Could you please recommend an approach, that would allow wiping the state without down time of the cluster and dependent systems?

That may not be possible. If there is a suspicion that the RAFT log is corrupted (due to 2 nodes writing to the same file), any attempt to try fix this may result in replicating the bad state to new nodes.

That being said, and since you are familiar with adding/removing nodes, here is what I would try. Assuming that you have a cluster of size 3, with 2 nodes that have been using the same RAFT log that you want to wipe out.

Add a new server to the group, which will bring size to 4.
Let it catch-up with the rest of the cluster. If you are using monitoring you could check that all channels and subscriptions exist and that the count seem to be moving at the same rate than the other up-to-date nodes.
Add a new server to the group, which will bring size to 5.
Do the same than above (make sure this server is up-to-date)
At this point you would have a size 5 which can tolerate 2 servers being down. You can always increase the size of the cluster to have more guarantee that the cluster would be operational when trying to bring down the 2 bad nodes, so you could for instance repeat the process described about to for instance bring the size to 7.
Shutdown one of the bad node. Wipe both the raft_log_path and dir directory before restarting it and let it catch-up with the reset of the cluster.
Repeat the step above with the 2nd bad node.
You can now reduce the size of the cluster by removing nodes with the command as you know how to do it.

Note that when shutting down nodes or even adding some, a leadership election can occur. As a matter of fact, leader election can happen at any time, say due to a network slow down or increase of traffic that would make RAFT heartbeats be delayed, etc... When an election occurs, in flight publish will possibly fail, so even the above steps does not guarantee "without down time" expectations.

nats-io / nats-streaming-server

Message Delivery - the Leader skips the messages #1198