nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

Message Delivery - the Leader skips the messages #1198

Open Xswer opened 3 years ago

Xswer commented 3 years ago

We are experiencing a strange NATS behaviour: 1) Publisher-Client publishes a message to NATS without any errors 2) Consumer-Client can not receive the message, as if the message was never sent. Consumer uses durable queue group subscription. 3) Further and previous messages are published and consumed as usual.

What helps:

The problem consistently appears on at least 2 channels. We assume, that the problem could be caused by corrupt raft logs. Could that be the cause?

Why we came to such assumption - we had a false configured redeploy pipeline, that allowed blue/green deployment. That means, that two instances could reference the same file store for a short period of time.

Could you please tell, whether simultaneous access to the raft log of 2 instances could cause raft log corruption? Could corrupted Raft log cause the described behaviour, when the newly published messages are not delivered to clients?

Thank you in advance!

kozlovic commented 3 years ago

Could you please tell, whether simultaneous access to the raft log of 2 instances could cause raft log corruption?

Definitively.

Could corrupted Raft log cause the described behaviour

Not sure, but once you solved your deploy issue and this situation (raft.log corruption) is guaranteed not to exist (by the way, you probably need to wipe all your state since we don't know what got corrupted), do you experience this issue? If not, then there is a good indication that yes, the corruption did cause this behavior.

But it could also be as simple as that the queue group is stalled, that is, each member of the group received their max inflight messages and are not properly ack'ing the messages, or acks are lost for some reason (network issues, etc..).

Creating a new group would then work because well, it is a different group, so server would deliver messages to this new group, but if this is the same application code and say there is a problem of not ack'ing messages, then this new group would start to behave as the old one.

It could also be that messages were delivered but did not reach the app (again, could be network, or issue with client library that does not set pending limits correctly and messages are dropped by NATS because of those limits). In that case messages should be redelivered, but that is based on member's AckWait, so check on that. If the AckWait is very high then it will take a long time before message is redelivered.

I know that you are using monitoring, so you should check for "stalled" state of those queue members. If they are not, and you still experience issues, then it is more an indication of the log corruption.

Xswer commented 3 years ago

@kozlovic thank you for your response and recommendations. The Subscriptions were not stalled, so messages should have been processed as usual, based on what I have seen in monitoring. AckWait is configured to be 30s, so that should not cause such behaviour as well.

At this point we would like to try to wipe the existing state. Could you please recommend an approach, that would allow wiping the state without down time of the cluster and dependent systems? I assume, that it could be reached via redeploying peers with lower store_limits.max_age parameter (at the moment is 336h => temporarily set lower to 3h), but again that could lead to differences in Raft Logs between the peers.

kozlovic commented 3 years ago

At this point we would like to try to wipe the existing state. Could you please recommend an approach, that would allow wiping the state without down time of the cluster and dependent systems?

That may not be possible. If there is a suspicion that the RAFT log is corrupted (due to 2 nodes writing to the same file), any attempt to try fix this may result in replicating the bad state to new nodes.

That being said, and since you are familiar with adding/removing nodes, here is what I would try. Assuming that you have a cluster of size 3, with 2 nodes that have been using the same RAFT log that you want to wipe out.

Note that when shutting down nodes or even adding some, a leadership election can occur. As a matter of fact, leader election can happen at any time, say due to a network slow down or increase of traffic that would make RAFT heartbeats be delayed, etc... When an election occurs, in flight publish will possibly fail, so even the above steps does not guarantee "without down time" expectations.