nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

Limit number of redelivery attempts #728

Closed robertmircea closed 4 years ago

robertmircea commented 5 years ago

What would be the best strategy to limit number of redeliveries of the same message to subscribers which are failing to ack?

For example: If the messages is not acked several times while attempting delivery to client, I want the message to be redelivered for up to x times and then dropped.

Derived question: Is it possible to keep a counter on server of how many times a message has been redelivered to client?

kozlovic commented 5 years ago

What would be the best strategy to limit number of redeliveries of the same message to subscribers which are failing to ack?

ACK it.. this is not being sarcastic ;-). Really, if your application keeps getting a redelivered message and is unable to acknowledge it because say processing of it is failing, and the application really decides that it does not want to have to deal with it, simply ACK it. The server will stop the redelivery.

Messages are not removed from the message log because they are ack'ed by consumers. Messages are always in the message log since they may have to be delivered to new consumers starting at any sequence in the log: https://github.com/nats-io/nats-streaming-server#message-log

The only advantage of having such configuration in the server would be when you try to limit the redelivery of a message to an application that has exited without calling Close() (therefore server does not know). But with pings from server to client, server will ultimately detect that the client is not responding and will suspend redelivery. The default heartbeat interval is 30seconds. If that's too high, it can be lowered through configuration.

robertmircea commented 5 years ago

I understand what you are saying, but this means that the app needs to be aware and keep track of the messages for a period of time in order to detect repeated delivery attempts (even the failed ones) up to a maximum. My first thought was to keep the message id in Redis for a while for idempotency checks. This is not very elegant and it scales up to a point. A better mechanism would be in my opinion to keep track on the server of number of redeliveries and either server or the client to have a mechanism to decide if more redeliveries are allowed. This would imply to have an explicit Nack command or to allow the client to instruct the server to deliver based on a specific policy. Probably, what I need here is to have the properties of a queuing system in nats server.

dansouza commented 5 years ago

@robertmircea if you're getting the same message redelivered over and over, chances are, a single subscriber will see it multiple times - so you don't need to create a counter on Redis for every single message you receive, only the ones that you're seeing twice (meaning they're getting redelivered in a loop).

Keep an in-memory counter of the messages you've seen with a cuckoo filter and if you see a message twice, then you start counting its redeliveries on Redis. This should scale a lot better since it won't hit Redis everytime, only when you suspect that a message is getting redelivered.

Once you hit enough redelivery attempts, you ACK the message (to remove it from the server) and delete it from the cuckoo filter and from Redis.

It won't perfectly track the number of redeliveries if you have many subscribers (since each subscriber will have its own counter that needs to see a message twice), but it should be good enough and scale well.

vtolstov commented 5 years ago

if i have invalid message, for example payload of it created with buggy service, consumers can't success ack this message. what is the preferred way?

kozlovic commented 5 years ago

Again, from a consumer perspective, if you want to stop the server to redeliver this message, simply ack it. When the message callback is invoked, and say your application cannot process it, you still can call msg.Ack() on it. What you can't do, is remove a "random" message from a message log.

vtolstov commented 5 years ago

my question more from devops perspective, for example i see in prometheus that some service have big error counter in subscribe handler. what can i do? i can't rewrite service now. so what workaround to temporary fix this issue , why service rewritten and deployed?

robertmircea commented 5 years ago

Again, from a consumer perspective, if you want to stop the server to redeliver this message, simply ack it. When the message callback is invoked, and say your application cannot process it, you still can call msg.Ack() on it. What you can't do, is remove a "random" message from a message log.

I understand this, but this strategy means that on first failed message processing I should remove (ack) the message no matter if the error is transient or permanent. Usually, if it is a transient error this means that I should retry processing for a number of times before giving up. NATS server could help in this situation by keeping the number of retries instead of a boolean flag like redelivery. It is very easy to deduct if it is a redelivery by inspecting the count of retries. If it is greater than zero, it means that it was a redelivery. In my case, if I would like to retry for up to n times, maybe would be more efficient to have an explicit nack protocol command without waiting for timeout. For example, if the handler throws an exception would be more effective to fail fast. In case that redelivery counter would be n, I would explicitly ack it from the client in order to be removed from server.

vtolstov commented 5 years ago

Yes,does it poossible to count delivery attemps per durable group?

vtolstov commented 5 years ago

I can check attemps and do some client side action, for example serialize event to store and ack message

sujeshthekkepatt commented 5 years ago

Same need. Any updates on this?

kozlovic commented 5 years ago

@sujeshthekkepatt I commented on the other issue related to this: https://github.com/nats-io/nats-streaming-server/issues/789

bmcustodio commented 4 years ago

NATS server could help in this situation by keeping the number of retries instead of a boolean flag like redelivery.

@kozlovic while I understand that limiting the number of redelivery attempts or implementing any kind of dead-letter queue is out of scope, would it be feasible for NATS Streaming to provide a counter of redeliveries as suggested here (possibly as a new field rather than replacing the existing Redelivery one, for compatibility with existing consumers)?

sujeshthekkepatt commented 4 years ago

@bmcstdio that would be great. I have done a work around using the same concept where we manually add some metadata like number of retries etc and use that fields for pushing to multi stage poison queues and later to a dead letter queue. But as you said having this built in would be perfect. I think they are working on a new revised version of Stan.

kozlovic commented 4 years ago

@bmcstdio Possibly. The caveat is that this number would likely be "valid" only during the runtime of a server. Meaning that I don't think that it would be feasible to persist the delivery count (but maybe?). Also, in clustering mode, when leader changes, it may not have the redelivery count (again, if that info is not stored/replicated). In this worst case scenario (valid only at runtime/leader election), would that still be valuable?

bmcustodio commented 4 years ago

@kozlovic I think it would still be valuable, yes, but of course the absolute best would be to persist that information. I am not familiar at all with the NATS/NATS Streaming code base, but I'd be willing to contribute to this somehow (either implementing, reviewing or testing).

kell18 commented 4 years ago

Redeliveries counter would be nice to have for us as well! Even if only during the runtime (not persistent). Thanks!

kozlovic commented 4 years ago

@bmcstdio @sujeshthekkepatt @kell18 Ok, I have decided to add RedeliveryCount (only runtime at the moment, and won't survive a leader election if leader is different node). Starting with the client's repo that holds the PubMsg protobuf: https://github.com/nats-io/stan.go/pull/295.

kozlovic commented 4 years ago

Server PR will be submitted shortly after we agree on the field name/type in the client repo. Thanks!

kozlovic commented 4 years ago

@bmcstdio @sujeshthekkepatt @kell18 Question for you guys.. the redelivery count is per subscription, but in the case of a queue group, would you expect the redelivery count to be for the group, or individual member?

bmcustodio commented 4 years ago

@kozlovic I'd expect it to be for the group.

vtolstov commented 4 years ago

For group.

пт, 20 дек. 2019 г., 18:58 Bruno M. Custódio notifications@github.com:

@kozlovic https://github.com/kozlovic I'd expect it to be for the group.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nats-io/nats-streaming-server/issues/728?email_source=notifications&email_token=AADVQGYCB2BN42SAUSZRTH3QZTTUHA5CNFSM4GPTUCO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHNJTDY#issuecomment-567974287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADVQGY4NWZXEC5BZAZBOZTQZTTUHANCNFSM4GPTUCOQ .

robertmircea commented 4 years ago

For group for sure!

kozlovic commented 4 years ago

Ok, I hear you, but in case you did not notice, if you use clustering, as of now, a message is already redelivered to the same member (subject to change in the future, but as of now that is the case), so that will not make much of a difference :-)

kozlovic commented 4 years ago

Guys... closing this one in favor of #996