Nats: Make MaxAckPending and MaxAckWaitTime configurable

kobergj commented 1 month ago

When putting heavy load to nats receivers the nats service will possibly redeliver events and therefore duplicate them. This is because we do not configure it correctly.

When pulling events from the queue the expected behaviour is the following: The event is pulled from the queue and delivered into a channel. When consumed from this channel the event will be acknowledged on the nats server. On high load this can cause problems. Sine MaxAckWaitTime is 30s by default, the nats server will redeliver the event after 30s. Leading to two identical events waiting to be picked up from the channel. Since MaxAckPending is set to 1000 by default, this can lead to up to 1000 identical events waiting for being processed.

We already introduced worker groups to be able to handle multiple events at the same time. We now need to make MaxAckWaitTime and MaxAckPending configurable, so they can configured individually.

Acceptance Criteria:

update nats-js go-micro package to expose MaxAckWaitTime and MaxAckPending
introduce envvars to make them configurable
set sane defaults

wkloucek commented 3 weeks ago

should we already add those settings to our oCIS chart with NATS deployment example? https://github.com/owncloud/ocis-charts/blob/main/deployments/ocis-nats/streams/ocis.yaml exposes all the stream settings already.

Our Kubernetes deployments already could benefit from this a lot!?

kobergj commented 2 weeks ago

Our Kubernetes deployments already could benefit from this a lot!?

Yes it should. If possible please set them.

wkloucek commented 2 weeks ago

@kobergj do you have proposals for MaxAckWaitTime and MaxAckPending that we could set on all / a subset of streams?

Just to interweave our ticket clusters a little bit:

https://github.com/owncloud/ocis/issues/8949 would maybe take out some pressure on high message count situations!? It also would allow more fine granular redeliver settings. Eg. the postprocessing messages could need a very high MaxAckWaitTime if we had a long running postprocessing step while SSE related events need to be dropped anyways if they are not delivered within < 10 seconds or so!?

micbar commented 2 weeks ago

g. the postprocessing messages could need a very high MaxAckWaitTime if we had a long running postprocessing step while SSE related events need to be dropped anyways if they are not delivered within < 10 seconds or so!?

That sounds reasonable. The requirements for the different use cases could be fulfilled better in separate streams.

kobergj commented 2 weeks ago

https://github.com/owncloud/ocis/issues/8949 would maybe take out some pressure on high message count situations!? It also would allow more fine granular redeliver settings

Not sure about this. If postprocessing service is busy it doesn't matter how many queues it talks to. Maybe we should use regexes to subscribe only for specific events?

. Eg. the postprocessing messages could need a very high MaxAckWaitTime if we had a long running postprocessing step while SSE related events need to be dropped anyways if they are not delivered within < 10 seconds or so!?

No. This is unrelated to the time a step needs to complete. As soon as the message is pulled from the channel, it will be acked on the nats server. The problem is that pending messages get redelivered, therefore multiple (equal) events are waiting on the same channel to be picked (and acked).

Recommendation is to keep MaxAckWaitTime to a high number (30s+) and decrease MaxAckPending to a very low value (3-)

jvillafanez commented 1 week ago

For the short term, I think we can make those options configurable to have a better control of the problem, which could also help to debug the problem by reducing the wait time so the issue happens more often, but I don't think it will fix the issue. I mean, 30 secs should be enough for the postprocessing service (or any other consumer) to get the event and send the ACK, so if that's a problem we could consider that 5 minutes might not enough and we need a greater value.

The idea I have is to include an ever-increasing counter for each event type. Basically, saying that "this is the event number 123 of the share_created event" for example. This counter must be part of the event information, either as an event property or as part of the event metadata.

Assuming each event type is generated in only one place, having a counter there to know how many events have been sent should be easy. This is basically the only responsibility the event publishers will have (in addition to other responsibilities that they could already have)

From the consumer's side, we need to keep track of the last event of that type we've processed. For example, if we've processed the "share_created" event number 123, we should expect the "share_created" event 124, so any "share_created" event with a counter lower than 124 (including 123) should be considered as processed and can be ignored. This is how we could deal with duplicated events.

There are some edge cases to consider for this solution though:

Events out-of-order. Basically, we're expecting event 123 but we receive event 126. We don't know what happened with events 124 and 125: they could have been lost or they could arrive later (the received event order could be 126, 124, 125). There could be 2 options to handle this case:
- If event order is important, event 126 could wait for events 124 and 125 to arrive. If 125 arrives before 124, then 125 could also wait (we're waiting for event 124 anyway). Once event 124 arrives and is processed, the other events can be processed in order. Of course, we'll wait for some time (likely configurable), but we can't wait forever because the event might have been lost.
- If the event order isn't important, we could process event 126 right away, but we'll have to mark events 124 and 125 (or even an event range) as "not processed", so in case event 124 or 125 arrive, they can be processed and not skipped as duplicates.
We're assuming that the event counter will always increase, but it might not be always the case. The event counter might be reset if the publisher is restarted / killed (so the counter is lost), or we reach the max integer. In this scenario, the consumer has a greater value than the publisher, which will cause problems because the consumer will think that all the events (until the event counter reaches the value in the consumer) will be duplicated. To solve this case, we can include a "reset" property that will be set to true only in the first event sent. This way, the consumer knows that it needs to use a new counter.
In case the publisher has multiple replicas, the replicas might be desynced (one replica might be restarted so the counter will be reset). This might confuse the consumer because one replica could send an event counter 478 while the other 122 for the same event. This might need further analysis because having a property to keep track of the event's origin might not be enough.

Basically, we'd need to include the following information in the event:

The event counter or event number
Whether the event is the first event of a new series of events (the "reset" flag mentioned above)
A timestamp when the event happened (could help to perform additional validations)

For nats, most of the stuff above is very likely taken care by nats in some way or another. The question is whether we're fetching the events one by one, so the only event being re-delivered is the last one (waiting in our queue to be ACK'ed). Basically, in case of delays, the expected sequence of events should be 1,2,3,3,3,3,3,4,5,6.... and not something like 1,2,3,3,4,5,3,3,3,6... or 1,2,3,4,5,3,4,5,6..... If it's the first case, keeping the last event in memory to compare to the one we're receiving should be enough to detect the duplication and skip it.

wkloucek commented 1 week ago

Another solution could be switching to the pull client instead of using the push client?

kobergj commented 1 week ago

Let me explain the problem again, I think there is still some misunderstanding.

The default value for MaxAckPending is 1000. That means 1000 events are waiting at a channel to be picked up. However there are more events already waiting in the queue. That means as soon as one event is picked from the channel, a new one is being added. When the consumer now takes too long to get through the events, nats will redeliver pending events after MaxAckWaitTime. This leads to even more events waiting at the channel to be picked up. Ultimately this leads to the consumer being unable to work on all events because they are redelivered to fast.

This is a simple configuration problem. With a low MaxAckPending value nats will not need to redeliver events because the consumer is fast enough to pull all events. MaxAckWaitTime is not really needed as the default is already sane, but we can still offer it.

Putting more work to the consumer will only make the problem worse. If we want to improve performance of the event consumers we should use regex filtering on the event names. Afaik that is done on nats side and therefore doesn't put pressure on our consumers.

jvillafanez commented 1 week ago

This is a simple configuration problem. With a low MaxAckPending value nats will not need to redeliver events because the consumer is fast enough to pull all events. MaxAckWaitTime is not really needed as the default is already sane, but we can still offer it.

I'm not sure... If we're processing events one by one, is there any benefit to set the MaxAckPending value higher than 1? It seems the risk of redelivering is higher because there will be more events not yet ack'ed that have been sent. Even if there is a networking benefit of sending a lot of events together at the same time, it doesn't seem a good deal if there will be problems with the redelivery of the messages.

I'm worried this will end up being a test in production in order to figure out the right values for the environment, and those values might not be good enough if the system is overloaded. I'm ok exposing that configuration for performance tuning, but I don't think wrong values should break the system.

owncloud / ocis

Nats: Make MaxAckPending and MaxAckWaitTime configurable #10393