openfaas / nats-queue-worker

Queue-worker for OpenFaaS with NATS Streaming
MIT License
128 stars 59 forks source link

[Research] Retries for certain HTTP codes #100

Open alexellis opened 4 years ago

alexellis commented 4 years ago

This issue is to gather research and opinions on how to tackle retries for certain HTTP codes.

Expected Behaviour

If a function returns certain errors like 429 (too busy) (as can be set by max_inflight in the function's watchdog), then the queue-worker could retry the request a number of times.

Current Behaviour

The failure will be logged, but not retried.

It does seem like retries will be made implicitly if the function invocation takes longer than the "ack window".

So if a function takes 2m to finish, and the ack window is 30s, that will be retried, possibly indefinitely in our current implementation.

Possible Solution

I'd like to gather some use-cases and requests from users on how they expect this to work.


@matthiashanel also has some suggestions on how the new NATS JetStream project could help with this use-case.

@andeplane recently told me about a custom fork / patch that retries up to 5 times whenever a 429 error is received with an exponential back-off. My concern with an exponential backoff with the current implementation is that it effectively shortens the ack window and could cause undefined results.

The Linkerd team also caution about automatic retries in their documentation stating risk of cascading failure. "How Retries Can Go Wrong" "Choosing a maximum number of retry attempts is a guessing game" "Systems configured this way are vulnerable to retry storms" ->

The team discuss a "retry budget", should we look into this?

Should individual functions be able to express an annotation with retry data? I.e. a backoff for processing an image may be valid at 2, 4, 8 seconds, but retrying a Tweet because Twitter's API has rate-limited us for 4 hours, will clearly not work.

What happens if we cannot retry a function call like in the Twitter example above? Where does the message go, how is this persisted? See also (call for a dead-letter queue) in #81

Finally, if we do start retrying, that metadata seems key to operational tuning of the system and auto-scaling, should this be exposed via Prometheus metrics and a HTTP /metrics endpoint?

andeplane commented 4 years ago

Interesting! I understand your concerns with the ack window, and in our case, we can increase it by the max retry time. I'll report back here our experiences with high load.

alexellis commented 4 years ago

I'd encourage you to read up on the Linkerd caution over arbitrary retries too.

andeplane commented 4 years ago

Will do for sure :) Our use case is that we want to ensure that function pods don't get out of memory, so we limit the number of concurrent function calls with max_inflight. To allow auto scaling to kick in, we create an scale-up on 429's (thanks to @alexellis who designed this), and allow the queue workers to retry a few times so they may end up being executed either on new pods, or once the busy ones have finished one of their calls.

matthiashanel commented 4 years ago

When openfaas switches to jetstream dynamically extending ack_wait becomes an option. That'd be: while the request to the function is still outstanding, we request an extension every ack_wait/2. This way ack_wait wouldn't have to be set to the "correct value" and a reasonable default could suffice.

When looking into this I noticed there's a choice when the function returns an error code or an error is returned outright. The choice is to send an ack or not (and wait for redelivery by jetstream). In jetstream this is possible because flow control is decoupled from acking.

If I have a list of status codes after which a retry is desired I can add this together with jetstream support. Same thing applies to the callback. What if the function returns ok, but callback invocation does not...

In jetstream, number of redelivery attempts can also be limited, which would go hand in hand with this.