[Feature Request] Asynchronous Concurrency Limiting

kevin-lindsay-1 commented 3 years ago

My actions before raising this issue

[x] Followed the troubleshooting guide
[x] Read/searched the docs
[x] Searched past issues

Expected Behaviour

If you set max_inflight on an async function's of-watchdog, you might naturally assume that a queue worker will handle the case when it tries to invoke a function on a pod that has reached its max_inflight, and would therefore send the request to a different pod, if possible.

Current Behaviour

If you use a sync function and set of-watchdog.max_inflight on an of-watchdog pod, and invoke that function above that amount (assuming 1 pod), you will as expected get 429s letting you know that you’ve sent more requests than you’ve configured to be allowed to be worked on per pod.

However, the same behavior exists when using async functions; if you set queue.max_inflight=5 and of-watchdog.max_inflight=1 , if the queue worker attempts to send that request to a pod that already has 1 invocation in progress, it receives the same 429 and forwards it off to the gateway as your official response. I propose that a queue worker would instead retry on 429s, or a similar retry mechanism (as a 429 is a completely valid response for functions at the moment).

Possible Solution

Having a queue worker retry to a different pod (if possible), potentially with incremental backoff.

I’m also open to a completely different mechanism, and willing to implement.

Context

Having a queue worker retry if it attempts to invoke a pod that’s already hit its maximum concurrent invocations would probably be the last technical hurdle for me to get async autoscaling working well for “long-running” functions, and therefore I'd say this issue is related to openfaas/faas#657.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ): 0.12.21
Docker version docker version (e.g. Docker 17.0.05 ): 20.10.2
Are you using Kubernetes or faasd? kubernetes
Operating System and version (e.g. Linux, Windows, MacOS): MacOS/Linux
Code example or link to GitHub repo or gist to reproduce problem:
Other diagnostic information / logs from troubleshooting guide

alexellis commented 3 years ago

/message: enduser

derek[bot] commented 3 years ago

Thank you for your interest in OpenFaaS. This project is maintained and made available for hobbyists and commercial users alike, so we need to balance our time across everyone's needs. Whilst we are excited in your interest in using OpenFaaS, we would also ask you to take a look at our contribution guide on Setting expectations, support and SLAs.

Commercial users can purchase support in order to get dedicated help from OpenFaaS Ltd, or they can book ad-hoc consulting hours to get an engineer to dedicate time to helping them.

If that is not a good fit for you at this time, please check out the OpenFaaS GitHub Sponsors options which are priced for practitioners like yourself. Organisations can also sponsor through their GitHub billing relationship.

When you become a sponsor as an indvidual, it will show this on your issues and PRs, so that the community can see that you are supporting our work, and can prioritise your needs.

Thank you for supporting OpenFaaS.

alexellis commented 3 years ago

Thanks for the suggestion @kevin-lindsay-1 - we will prioritise your request on the roadmap and have a look into it, in due course.

derekyyyan commented 3 years ago

Hi, just wanted to chime in as I'm also running into this exact scenario.

One thing I noticed is that whatever is load balancing the requests to the function pods does not always send the request to an idle pod. For example, if I have 2 pods of a function with max_inflight=1, and then I invoke the function asynchronously twice, there is a chance that both requests go to the same pod, resulting in a HTTP 429 response for one of the requests.

It maybe worthwhile to implement some kind of monitor for when a function pod is "at capacity", i.e. handling max_inflight number of requests concurrently, and the queue worker can simply use the monitor to wait for a pod to become available, instead of blindly retrying with backoff. (I don't know if this is realistically possible.)

alexellis commented 3 years ago

One thing I noticed is that whatever is load balancing the requests to the function pods does not always send the request to an idle pod

Sounds like you may not have direct_functions set to false, which causes the gateway to use a Kubernetes service instead of the endpoint load balancing logic in faas-netes. See the guide on production use in the docs and upgrade to the latest version of the chart. @derekyyyan

derekyyyan commented 3 years ago

I can set direct_functions explicitly to false, although the docs for the faas-netes helm chart suggests that it already defaults to false.

I'm also not using Linkerd or Istio.

I'll try replicating the problem when I get the chance, but here's my helm chart settings in case you can spot something I'm doing wrong. I'm on version 7.1.0 so should be mostly the same as the latest (7.2.0).

values.yaml

```yaml functionNamespace: openfaas-fn httpProbe: false psp: true ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx-private cert-manager.io/cluster-issuer: letsencrypt-production tls: - hosts: - secretName: openfaas-tls-cert hosts: - host: serviceName: gateway servicePort: 8080 path: / faasnetes: httpProbe: false imagePullPolicy: Always setNonRootUser: true gateway: replicas: 2 queueWorker: replicas: 10 ackWait: 1m faasIdler: create: false ```

alexellis commented 3 years ago

@derekyyyan this is a different topic to the one that Kevin opened, so we are going off piste here and you should raise your own ticket.

However.

Show us how to reproduce your hypothesis that endpoint load-balancing isn't working for you?

One thing I noticed is that whatever is load balancing the requests to the function pods does not always send the request to an idle pod

What makes you think this is the case?

We've done quite extensive testing and see load-balancing taking place aggressively. I'm including output to show it happening as expected. You are welcome to try the same commands and monitor the results.

faas-cli store deploy nodeinfo --label com.openfaas.scale.min=5 -g 192.168.0.26:31112

alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-fjpf2
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-fjpf2
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-8nvll
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-5wt4c
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-47ghl
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-dlzt6
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-8nvll
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-dlzt6
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-47ghl
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-47ghl
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-dlzt6
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-5wt4c
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-5wt4c
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/function/nodeinfo|head -n1
Hostname: nodeinfo-855949ff9f-dlzt6
alex@alex-nuc8:~$

Async:

alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""
alex@alex-nuc8:~$ curl -s http://192.168.0.26:31112/async-function/nodeinfo -d ""

alex@alex-nuc8:~$ kail -n openfaas-fn
openfaas-fn/nodeinfo-6dbd4cb849-pbsjb[nodeinfo]: 2021/02/24 18:04:53 POST / - 200 OK - ContentLength: 106
openfaas-fn/nodeinfo-6dbd4cb849-z46dn[nodeinfo]: 2021/02/24 18:04:53 POST / - 200 OK - ContentLength: 106
openfaas-fn/nodeinfo-6dbd4cb849-cl2kg[nodeinfo]: 2021/02/24 18:04:53 POST / - 200 OK - ContentLength: 106
openfaas-fn/nodeinfo-6dbd4cb849-pbsjb[nodeinfo]: 2021/02/24 18:04:53 POST / - 200 OK - ContentLength: 106
openfaas-fn/nodeinfo-6dbd4cb849-dfds6[nodeinfo]: 2021/02/24 18:04:53 POST / - 200 OK - ContentLength: 106

alexellis commented 3 years ago

@kevin-lindsay-1 to focus back on your original suggestion.

However, the same behavior exists when using async functions; if you set queue.max_inflight=5 and of-watchdog.max_inflight=1 , if the queue worker attempts to send that request to a pod that already has 1 invocation in progress, it receives the same 429 and forwards it off to the gateway as your official response. I propose that a queue worker would instead retry on 429s, or a similar retry mechanism (as a 429 is a completely valid response for functions at the moment).

A retry has been implemented by at least one user that I know of, and makes for a more natural fit for async functions than synchronous because:

1) The body is small 2) The request is already recorded, so doesn't need to be buffered

My concerns are:

You may never be able to wait long enough to retry, what if the job takes 10 hours?
You will be blocking other executions from running
The retry and "busy wait" sounds like it would be stateful, and that state has to be managed somewhere
Would this open the gate to requests for retryign on other HTTP codes?
What if the 429 is not from the watchdog, but from a Twitter API or another API that will eventually ban you,if you simply do an exponential backoff, and don't use its custom HTTP headers for retrying?

kevin-lindsay-1 commented 3 years ago

@alexellis The way I see it, the fact that you get a 429 at all when using async with concurrency limits in an exposure of implementation detail (which is kind of expected at the moment, because the existing concurrency limiter was clearly designed for sync); when using sync it's part of the contract of sync functions, but when using async the contract is "handle this eventually, and let me know the results".

Getting a 429 on an async function doesn't make sense (unless it's an application error, such as a 429 on an HTTP request inside the function), because it's basically a statement of "i'm busy", to which the entity doing the async invocation would always respond, "yeah, that's why I said eventually.".

Re-queueing an async invocation after a 429 due to concurrency limits would require:

A bunch of boilerplate on every client that invokes a function async
Always setting a callback url
Listening on said callback url for a 429 from OF
Keeping track of requests by ID and their original payloads
Re-invoke the function every time, until it either fails or succeeds (non-429).

That's a lot of boilerplate and workflow accommodation that in my opinion would be better served being part of the OF framework itself, adding another feature to make sure that things "just work".

derekyyyan commented 3 years ago

@alexellis I didn't mean to derail the original request - I'll create a separate issue if I get the chance to reproduce the behavior.

Although if we get the original feature implemented, I won't have to worry about queue workers sending requests to a "busy" pod as they can simply retry later.

alexellis commented 3 years ago

Thanks for sharing your expectations Kevin. When you have some time, I'd appreciate a reply to the questions that I asked and the concerns on retrying.

kevin-lindsay-1 commented 3 years ago

@alexellis Ah, sorry, I wasn't sure if those concerns were around the OF implementation or the user implementation.

You may never be able to wait long enough to retry, what if the job takes 10 hours?

Does ack_wait not already account for this?

You will be blocking other executions from running

I use separate queues for my async functions already because the timing of specific jobs can get fairly difficult to configure with one global queue. A little too much mental overhead for something that's pretty easy to fix with multiple queues, considering how lightweight queue workers are.

The retry and "busy wait" sounds like it would be stateful, and that state has to be managed somewhere

Agreed, however if I recall correctly, the ack_wait implementation handles retry nicely as it is (afaik that uses a memory-based state system), so you could theoretically avoid such additional state in storage or memory by not considering a function "in progress" if you get a 429 from watchdog (or similar contract), and therefore not release the request in memory until you get a 302 or the like from a pod, potentially avoiding creating a bunch of additional state code. In other words, if you put the operation of retrying for concurrency before you actually consider it "queued", you may be able to avoid more expensive state.

If this wasn't implemented in OF I would have to do this myself, and it would likely require the same if not more expensive state, because I don't have the kind of contextual information that something internal to OF would have, so I would have to more-or-less keep track of "everything" so that I can invoke the same function with the exact same headers and payload, as well as cleaning out such a stateful system.

Would this open the gate to requests for retryign on other HTTP codes?

I would say that this is a contract between watchdog and the queue worker, and should therefore be only their collective concern, that is the scope of HTTP codes shouldn't increase unless the flow changes; application-level retry would maybe be a different feature request, but if that one day gets requested and implemented, a distinction between the watchdog and the function/handler itself may need to be made explicit in the HTTP request between the watchdog and queue-worker, if it's not already.

What if the 429 is not from the watchdog, but from a Twitter API or another API that will eventually ban you,if you simply do an exponential backoff, and don't use its custom HTTP headers for retrying?

As far as I'm aware with async, watchdog should return a 202 and if it errors its runtime pod should respond by sending an error response in the context, meaning that said 429 should be treated as the function response, rather than the queue response.

alexellis commented 3 years ago

Feel free to try out this prototype in your cluster - it's valid until 7th March to play with.

# Deploy demo
kind create cluster
arkade install openfaas --max-inflight=5 --set queueWorker.image=alexellis2/queue-worker:retry2

# Deploy test function
faas-cli deploy --name sleep --image ghcr.io/openfaas/sleep:latest \
   --env sleep_duration=5s \
  --env write_timeout=10s \
  --env max_inflight=1 \
  --env combine_output=false \
  --env read_timeout=10s \
  --label com.openfaas.scale.min=1

# Monitor (Terminal 1)
kubectl logs -n openfaas deploy/queue-worker -f|grep -v Received

# Now test with (Terminal 2)
time curl -i http://127.0.0.1:8080/async-function/sleep -d "" && sleep 4 && time curl -i http://127.0.0.1:8080/async-function/sleep -d ""

# View the logs in Terminal 1

You'll see the first request goes in-flight, the second gets retried a few times before being processed.

# Now try again and cause congestion with hey, 10 requests, only 1 can run at once.
hey -c 1 -n 10 -m POST -d "" http://127.0.0.1:8080/async-function/sleep

# To tune the minimum amount of replicas ready, add the the faas-cli deploy:
--label com.openfaas.scale.min=5

# To tune the concurrent connections to each function, update faas-cli deploy:
--env max_inflight=2

kevin-lindsay-1 commented 3 years ago

Will test

kevin-lindsay-1 commented 3 years ago

@alexellis I have tested this, playing with the minimum number of pods, and it seems to work quite well.

The only thing that I'm wondering about is that when the queue's max_inflight and number of pods are the same, I still see logs stating that functions were being retried. Shouldn't that not happen, assuming that the load balancing is balancing to a non-busy pod, and the queue worker has a pod available at all times for each invocation?

Specifically, I set the flag --label com.openfaas.scale.min=5 and used hey -n 100.

alexellis commented 3 years ago

Thanks for trying it with a basic scenario.

OpenFaaS doesn't know anything about individual pods, endpoint IPs, or whether a pod is receiving more or less traffic than has been configured for its watchdog. Endpoints are picked at random. You'll need a service mesh if you want "least connections" as the semantics. Even then, you will see some retries.

Perhaps you can try it with a staging workload now to see how well it processes your specific needs? The max retry limit is 10 attempts with 100 or 500ms as the delay for the backoff.

alexellis commented 3 years ago

I ran a quick test, this is what I got:

Max retries = 10
Backoff delay = 100ms, 200ms, 400ms, 800ms, 1600ms, 3200ms, 6400ms (etc)
Total async requests made = 20

Replicas of function = 5 (min)
Max concurrent in function = 1
Time for workload in function = 5s

Queue-worker processes = 5

Results:
Completed = 20
429s = 82

First message = 08:21:37
Last message = 08:22:36

Expected sync time - (20x5s = 100s)
Actual async time - 59s

derekyyyan commented 3 years ago

@alexellis, thank you for making a prototype so quickly.

I was able to replicate what @kevin-lindsay-1 reported: I set --label com.openfaas.scale.min=5 and queueWorker.maxInflight=5, then made 5 requests, and the queue worker logs showed that it had to retry one of the request 3 times, even though there are enough pods to accommodate all 5 requests.

This behavior is what I was originally referring to, sorry if I failed to explain clearly.

Endpoints are picked at random. You'll need a service mesh if you want "least connections" as the semantics.

I see. I think having a basic retry will inevitably lead to inefficient use of pods, but for my use case this is acceptable, and I can try using service mesh to improve efficiency.

Couple more questions/requests:

Ability to configure the max number of retries, potentially allowing unlimited retries.
Would it make sense to put a cap on the backoff delay to the value of ackWait? Since a function should never run longer than ackWait, it makes sense to me that the queue work shouldn't wait any longer than that for retrying.
I can imagine a situation where one request just keeps hitting a busy pod and increasing the backoff delay, while other requests are being handled, resulting in that one request stuck forever. While I don't need the requests to be handled strictly in order, it would be great if queue workers can prioritize requests that have been pending for the longest amount of time.

alexellis commented 3 years ago

The request was for asynchronous concurrency limiting, that is what the preview / demo does.

The ack time has to be set to:

The (max wait time for the max retries) + the function timeout

So if you have a maximum backoff of 32 seconds, and a maximum function timeout of 2 minutes, then you need to set your ACK time to 2m32s.

Each message is either executed to completion, or finds itself facing a retry, then waits once, before being re-queued.

This means that the ACK time is predictable and bounded.

Retries are to be expected, that was the feature requested here. OpenFaaS does not track connections to pods, that's the job of a service mesh. Even with a service-mesh, you cannot guarantee that a retried message will hit a pod with open capacity, it can however help that request hit a pod with fewer connections.

There are two mitigations for any concern about "a queue worker being clogged up" as such:

The max_inflight for a queue-worker can be set to a larger number like 100 (previously, this involved adding more Pods) - thanks to @matthiashanel for that change
You can have a dedicated queue for slow or concurrency-limited functions

Remember that we are using a very contrived example here, of one function which can only process one request at once.

The 429 messages could be used to trigger an auto-scaling alert, I'm aware of one or two users who do this.

My assumption is that, if the work is completed and the currency is handled in a sane, bounded way, then it is irrelevant whether there is capacity in one of five pods, and a retry still happens. Show us with real data why that matters? Show us what happens when you add a service mesh.

Whatever the workload, some tuning and experimentation will be required. This is a specialist case and I would like to see some real usage from the demo before considering further changes or features.

derekyyyan commented 3 years ago

While I go do some load testing, a couple more clarifying questions:

Each message is either executed to completion, or finds itself facing a retry, then waits once, before being re-queued.

What happens if a message has been retried for the max # of times? Are you saying that it gets put back into the queue? Or does the queue worker give up and sends a 429 as the official response to the callback URL?

The max_inflight for a queue-worker can be set to a larger number like 100

If max_inflight is much larger than the number of available function pods, wouldn't that cause lots of retries? Say, if I have max_inflight=100 and 10 available function pods, wouldn't that mean at any given time, there are 90 messages being retried over and over again? (Assuming the queue has plenty of messages). Maybe that's fine? I'll load test that scenario too.

This is a specialist case and I would like to see some real usage from the demo before considering further changes or features.

I agree that my use case is niche (and perhaps Kevin's too), but I definitely will be using this demo, as the original behavior where the queue worker discards messages when function pods are busy simply doesn't work for me. The extra requests from my previous comment are nice-to-have's, but I could live without those.

kevin-lindsay-1 commented 3 years ago

@derekyyyan

What happens if a message has been retried for the max # of times? Are you saying that it gets put back into the queue? Or does the queue worker give up and sends a 429 as the official response to the callback URL?

I imagine the latter, but it would be nice to know this explicitly as part of what I assume will be in a follow-up PR to the documentation.

If max_inflight is much larger than the number of available function pods, wouldn't that cause lots of retries? Say, if I have max_inflight=100 and 10 available function pods, wouldn't that mean at any given time, there are 90 messages being retried over and over again? (Assuming the queue has plenty of messages). Maybe that's fine? I'll load test that scenario too.

I use a different queue for each async job because of such fine tuning like this (mainly to remove ambiguity), but just looking at your description of the setup I'd say unless you have a pool of other async jobs that are going to be queued at the same time, you probably shouldn't specifically tell the system to effectively overload the workload of the pods. You can likely remove this issue by autoscaling and adding additional pods to handle the workload.

kevin-lindsay-1 commented 3 years ago

@alexellis

The 429 messages could be used to trigger an auto-scaling alert, I'm aware of one or two users who do this.

That actually sounds like a really elegant way to autoscale; I'll need to test that out.

This is a specialist case and I would like to see some real usage from the demo before considering further changes or features.

From the testing I've done this is a major step and I'm happy enough with the current behavior; I'll need to eval it over time and see if I've got any pain points, but as it stands I don't currently have any tangible items that would make me think that this solution isn't currently satisfactory. Can't think of any additional things that this feature needs right now.

As long as this feature exposes tuning parameters, I'm sure I can take it a long way before encountering any new needs.

kevin-lindsay-1 commented 3 years ago

@alexellis

That actually sounds like a really elegant way to autoscale; I'll need to test that out.

One note on that: if 429s from the function itself are indistinguishable (in prometheus) from 429s from watchdog, it could theoretically create application-level bugs, because normally you'd backoff if your functions start getting 429s, rather than increasing the number of pods.

Example:

function/a: watchdog 429s, autoscale up, which handles additional requests as intended
function/b: application 429s, autoscale up, which hammers an external API even though we weren't necessarily at capacity, and causes an increased amount of invocations in our queue to callback with 429.

This can likely be mostly, if not entirely, mitigated by handling 429s in your application, but it might be something that users need to watch out for.

derekyyyan commented 3 years ago

I've done some load testing, all with 100 requests and a sleep duration of 10s, and using Request Bin to record callbacks:

queue.max_inflight	# pods	time elapsed	expected time (# requests / # pods * sleep duration)	% increase from expected time	# Callbacks received
10	1	1014s	1000s	1.4%	76
10	5	343s	200s	71.5%	100
10	10	168s	100s	68%	100
5	1	1326s	1000s	32.6%	89
5	5	276s	200s	38%	100
1	1	1022s	1000s	2.2%	100

When a queue worker reaches the max # of retries, it gives up without sending a 429 to the callback URL, as shown by the examples with # Callbacks received less than 100. I checked the queue worker logs and the number of times it says giving up matches the number of missing callbacks. I think queue workers should return a 429 to the callback, so that users can at least detect when messages are being given up.
It seems that the higher the queue.max_inflight, the less efficient the system becomes. as shown by comparing % increase from expected time against _queue.maxinflight. The first row is an exception, but I suspect that only happened because the queue workers are giving up on messages. With max_inflight=10, it takes ~70% more time to handle all the requests, which translates to ~40% loss of efficiency. While I can live with this, it certainly isn't negligible.

derekyyyan commented 3 years ago

you probably shouldn't specifically tell the system to effectively overload the workload of the pods. You can likely remove this issue by autoscaling and adding additional pods to handle the workload.

I think the ideal solution would be to dynamically scale either queue.max_inflight or the number of queue worker pods along with the function pods. That way the concurrency capacity of both the queue workers and the function would be in sync. (Unless I'm mistaken, this feature doesn't exist yet. Though this is probably worth a separate issue.)

alexellis commented 3 years ago

@derekyyyan you are missing function.max_inflight from your list, and I think you need to test with different durations. Not everyone is going to have a function that always takes exactly 10s to complete, or that only has a fixed number of replicas.

Do you have a use-case for this request yet @derekyyyan, or is this mostly academic? I'm asking because you will need to do some testing, and tuning depending on your specific use-cases, probably at a queue-worker level. Where you quote a "36% decrease in efficiency" - this is a specific example of where tuning would be beneficial.

One replica processing 100 requests with a concurrency level of 5 doesn't make a lot of sense.

The demo that I published doesn't have configuration for the backoff delay, or the maximum number of retries. It also doesn't support dead-letter behaviour, like you mentioned.

alexellis commented 3 years ago

For scaling up, try something like:

        - alert: APIHighRetryRate
          expr: sum(rate(gateway_function_invocation_total{code="429"}[10s])) BY (function_name) > 1
          for: 5s
          labels:
            service: gateway
            severity: major
          annotations:
            description: High retry rate on "{{$labels.function_name}}"
            summary: High retry rate on "{{$labels.function_name}}"

Run kubectl edit -n openfaas configmap/prometheus-config to add it to your config.

This is what it looks like, but beware that the number of 429s will change over time, starting quickly and then backing off as the exponential retry time goes up.

kylos101 commented 3 years ago

Hi,

My exposure to 429 is on the function side - not the watchdog side. When developing a function that uses Twilio, I had a bug that caused the Twilio API to respond with a 429 due to my misuse. Enter my first exposure to 429, and the concept of exponential backoff.

At the time, my function logged the 429 and returned a 500, instead of passing through the 429, which was intentional. I then fixed my bug and never thought about this again - until this discussion.

Now that we're here (I see this mentioned above) the use case I am thinking of is:

Consumer calls function via async
Function encounters 429 from an external source
How do we handle that?

For function retries, It seems like my options are:

Log the 429 and return a 500, no retry will be done, like I am now. This seems crude.
Implement retry logic in my function, for example, Twilio returns a Retry-After header! I don't want to do this for every function, I'm lazy.
Pass the 429 back as the async function's response, and let a service mesh do retries somehow. I wonder if I'll be able to share the Retry-After with LinkerD? We'll see!
Try the code in this branch, to see how OpenFaaS handles functions returning 429 responses.

I am currently leaning towards option 3 to do function retries, using LinkerD to do the retry for me. For example, if my function passes the 429 through using out-of-the-box OpenFaaS (so not the code in this issue), my expectation is that LinkerD will try it again. I can setup this policy to only apply to the async path to the function's service.. More updates to follow.

Also, I'll try to test cold starts in a scale from zero setting. I'm not sure what to expect there...if the pod isn't alive, the watchdog won't be able to respond, right? So in that case, the queue worker I am guessing would return 429...perhaps LinkerD can help there too? We'll see.

kylos101 commented 3 years ago

Just went through the demo more, and better understand how these settings do tuning. I didn't get a chance to do cold start testing, or with LinkerD but will do so later today.

with 1 replica, and max in-flight 1, can handle 10 incoming requests in roughly 1m54s, in 69 attempts
with 5 replicas, and max in-flight 1, can handle 10 incoming requests in roughly 20 seconds, in 40 attempts
with 5 replicas, and max in-flight 2, can handle 10 incoming requests in roughly 8 seconds, in 10 attempts
with 5 replicas, and max in-flight 2, can handle 10 incoming requests in roughly 6 seconds, in 10 attempts
with 5 replicas, and max in-flight 2, can handle 10 incoming requests in roughly 5 seconds, in 10 attempts
with 5 replicas, and max in-flight 2, can handle 10 incoming requests in roughly 5 seconds, in 23 attempts
with 5 replicas, and max in-flight 2, can handle 100 incoming requests in roughly 2m14s-2m15s, in 349-353 attempts

I'm guessing the variance is to due to picking pods at random, instead of free ones? I'll see if the LinkerD tcp connection balancing stuff makes that go away - like you mentioned Alex.

alexellis commented 3 years ago

Thank you for sharing those results. There are quite a few variables here including the average duration of a function execution.

For Linkerd, just make sure that you have disabled endpoint load-balancing so that it can do its thing (direct_functions=true)

kylos101 commented 3 years ago

Yes, totally. Now that I've dug in only a little, I see more knobs to turn and levers to crank...there's a lot!

Appreciate the reminder regarding direct functions.

I'm logging and visualizing using loki stack. I took some screen shots of the exponential back off tapering...it was nice to visualize when there was more load.

On Sat, Mar 6, 2021, 11:37 AM Alex Ellis notifications@github.com wrote:

Thank you for sharing those results. There are quite a few variables here including the average duration of a function execution.

For Linkerd, just make sure that you have disabled endpoint load-balancing so that it can do its thing (direct_functions=true)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openfaas/nats-queue-worker/issues/105#issuecomment-791985987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWSDHGGQJZALXPUPVZRO3TCJK53ANCNFSM4YIAHTDQ .

kylos101 commented 3 years ago

Wow, it's very balanced with --set gateway.directFunctions=true and when using LinkerD.

OpenFaaS is chewing through the queue and the demand is spread out evenly amongst the sleep function pods. Granted, I am using the sleep function, so there's no variance with each request. But still....cool.

Thanks for sharing a demo!

alexellis commented 3 years ago

That's sounds even better than expected. Do you have a similar number to share like your previous test to compare with?

What about some of those screenshots?

kevin-lindsay-1 commented 3 years ago

@alexellis as it pertains to the prometheus rule for 429s, if this behavior is consistent across async and sync functions, is it possible that this could be the default autoscaling rule?

Here are some arguments I might expect against such a change:

Concurrency limits are not required.

To me, it seems that concurrency limits are effectively enforced even if not explicitly provided, by pod CPU/mem limits. I would argue that anyone creating a function should be aware of their function's limits, and since limits have reasonable defaults for CPU and mem, I would argue that concurrency should probably also have a reasonable default, because eventually, a pod is likely to crash if it takes on too many requests, which is kind of just another form of concurrency limiting. Like a lot in OF, it's up to the person writing the function to know how CPU and memory intensive the function is, and as such should probably be expected to know how many invocations it can handle at once before hitting said limits.

I don't want 429s to be returned by watchdog under any circumstances; I don't want to use the concurrency limiting feature.

A potential issue with this approach, to me, is that if you don't limit concurrency, you could crash the pod, which I would argue is worse than having to retry, especially if your function is stateful or non-idempotent. Even though those issues would be caused by lack of following best practices, I would argue that if OF can protect users from such a case, it probably should, because if a developer isn't willing to retry a function, who's to say they're willing to write stateless or idempotent functions?

Additionally, there are likely tricks that could be used to lower the likelihood of receiving a 429 in sync invocations, but I'm not sure that's necessary, as you can auto-retry with most HTTP tools with minimal effort.

As a side effect of returning 429s by default, it would improve visibility, because identifying a 429 is easy, if not automatic, but identifying that an error corresponds to a pod crashing is much more difficult to identify and troubleshoot.

Another likely beneficial side effect might be that having the user be explicit about their concurrency limits will probably also cause them to think about tuning their pod's limits, which might lead users to go for many small pods rather than few large pods when they might not otherwise think about it.

talhof8 commented 3 years ago

Any updates/plans?

I encountered the same scenario. Until then, I’ll fork this repo & re-enqueue 429s.

kevin-lindsay-1 commented 3 years ago

@talhof8 as far as I am aware, resolving this issue would probably require a slightly different set of features, and is only waiting being prioritized (enough demand) & incubated.

kevin-lindsay-1 commented 2 years ago

For scaling up, try something like:

        - alert: APIHighRetryRate
          expr: sum(rate(gateway_function_invocation_total{code="429"}[10s])) BY (function_name) > 1
          for: 5s
          labels:
            service: gateway
            severity: major
          annotations:
            description: High retry rate on "{{$labels.function_name}}"
            summary: High retry rate on "{{$labels.function_name}}"

@alexellis I notice that this is in the helm chart, but it doesn't appear to be configurable. Do you think it would be reasonable to create an issue for this?

Thinking of making a PR for this, but I'd like to at least offer a discussion if you think it would be a good idea to make other things configurable in here.

Perhaps as a separate issue, it might also be nice to be able to choose the alert used for autoscaling, so that you can have 1 function use a certain rule, and other use a different rule, which would have a number of advantages, such as safely testing scaling behavior.

alexellis commented 2 years ago

I would be happy to see an issue. Thanks for commenting.

This thread can probably be closed due to the release of the pro queue-worker https://github.com/openfaas/faas-netes/pull/851

kevin-lindsay-1 commented 2 years ago

This thread can probably be closed due to the release of the pro queue-worker openfaas/faas-netes#851

Agreed

kevin-lindsay-1 commented 2 years ago

@alexellis https://github.com/openfaas/faas-netes/issues/852

alexellis commented 2 years ago

@derekyyyan @talhof8 you may want to take a look at what we did here: https://docs.openfaas.com/openfaas-pro/retries/

openfaas / nats-queue-worker