openfaas / faas-netes

Serverless Functions For Kubernetes
https://www.openfaas.com
Other
2.13k stars 471 forks source link

Queue-worker fail to retry function invocations #1021

Closed FTWH closed 2 years ago

FTWH commented 2 years ago

According to OpenFaaS official doc, when the function concurrent request limit is exceeded (function's max_inflights env variable), the function returns a 429 status code, and the queue worker rather than dropping the message simply submits it back to the queue. But in my experiments, the failed requests will not be retried for processing.

By the way, I have checked https://www.openfaas.com/blog/limits-and-backpressure/ and https://docs.openfaas.com/reference/async/.

Expected Behaviour

I edited deployment for queue-worker and set up env max_inflights:

spec:
      containers:
      - env:
        - name: faas_nats_address
          value: nats.openfaas.svc.cluster.local
        - name: faas_nats_channel
          value: faas-request
        - name: faas_nats_queue_group
          value: faas
        - name: faas_gateway_address
          value: gateway.openfaas.svc.cluster.local
        - name: faas_function_suffix
          value: .openfaas-fn.svc.cluster.local
        - name: ack_wait
          value: 60s
        - name: max_inflight
          value: "100"
        - name: max_retry_attempts
          value: "10"
        - name: max_retry_wait
          value: 120s
        - name: initial_retry_wait
          value: 10s
        - name: retry_http_codes
          value: 408,429,500,502,503,504
        - name: print_request_body
          value: "false"
        - name: print_response_body
          value: "false"
        - name: secret_mount_path
          value: /var/secrets/gateway
        - name: basic_auth
          value: "true"

I set concurrent request limit for a function in a yaml like this:

version: 1.0
provider:
  name: openfaas
  gateway: http://192.168.122.11:31112
functions:
  test-intra-parallelism:
    lang: python3-flask
    handler: ./test-intra-parallelism
    image: 192.168.122.11:5000/test-intra-parallelism:latest
    environment:
      max_inflight: 10

When bursty async invocations arrived, the function will try its best to handle requests at concurrency under the max_inflights, and all the requests cached in queue-worker will be processed later on.

Current Behaviour

I deploy a simple function that sleeps for 2 seconds at first and then writes a timestamp in Redis DB. I generate workload with hey:

hey -c 50 -m POST \
 -z 1s -q 1 \
 -H "X-Callback-Url: http://192.168.122.1:8000" \
 $OPENFAAS_URL/async-function/test-intra-parallelism 

the result was:

Summary:
  Total:        1.0243 secs
  Slowest:      0.0153 secs
  Fastest:      0.0108 secs
  Average:      0.0132 secs
  Requests/sec: 48.8128

Response time histogram:
  0.011 [1]     |■■
  0.011 [1]     |■■
  0.012 [2]     |■■■■■
  0.012 [6]     |■■■■■■■■■■■■■■
  0.013 [3]     |■■■■■■■
  0.013 [6]     |■■■■■■■■■■■■■■
  0.014 [5]     |■■■■■■■■■■■■
  0.014 [17]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.014 [7]     |■■■■■■■■■■■■■■■■
  0.015 [1]     |■■
  0.015 [1]     |■■

Latency distribution:
  10% in 0.0118 secs
  25% in 0.0127 secs
  50% in 0.0136 secs
  75% in 0.0139 secs
  90% in 0.0141 secs
  95% in 0.0146 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0044 secs, 0.0108 secs, 0.0153 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0003 secs
  resp wait:    0.0085 secs, 0.0053 secs, 0.0123 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
  [202] 50 responses

but only 10 timestamps were recorded, which means the OpenFaaS only handle 10 requests.

image

OpenFaaS needs to ensure that all sent asynchronous calls are processed, not just ignored some of them.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

Steps to Reproduce (for bugs)

  1. Prepare an OpenFaaS cluster in a kubernetes cluster.
  2. Deploy a function that simply write timestamp into Redis. Use python-flask template with the use of-watchdog.
  3. Add function's env with max_inflights=10.
  4. Edit queue-worker deployment and set up env max_inflights=100 (a large number to avoid bottlenecks).

Context

Before we decide to try the paid service of OpenFaaS Pro, we need to ensure the stability of the service. The limit of max_inflights is critical because too large an intra-parallelism can crash the container.

I don't think it's a design flaw, it's possible that the documentation wasn't clear enough to make my configuration wrong.

Your Environment

alexellis commented 2 years ago

Hi @FTWH thanks for your interest in OpenFaaS

As explained in the blog post and the documentation, retries are part of OpenFaaS Pro. There is no bug or issue here, and everything is working as described.

https://docs.openfaas.com/openfaas-pro/retries/

Here's the docs page that you linked to, it's also shown there quite clearly:

Screenshot 2022-09-21 at 14 03 10

OpenFaaS needs to ensure that all sent asynchronous calls are processed, not just ignored some of them.

No requests are ignored, check the Prometheus metrics and you'll see the 429 responses in the data there. It's just that the Pro solution will retry them, the Community Edition will not and is not intended for commercial exploitation, you can read a comparison here.

If you'd like to talk to us about OpenFaaS Pro, you can do so here: https://openfaas.com/support/

Alex