openfaas / faas

OpenFaaS - Serverless Functions Made Simple
https://www.openfaas.com
MIT License
24.75k stars 1.92k forks source link

Differentiate status codes from functions vs OpenFaaS components #1792

Open alexellis opened 1 year ago

alexellis commented 1 year ago

Description

Differentiate status codes from functions vs OpenFaaS components

Why do you need this?

When invoking functions or attempting to retry them, it's currently hard to differentiate between an error caused by Kubernetes (i.e. node eviction during an invocation) or by a function (i.e. a 429 because the Twitter API that was being used is overloaded)

Who is this for?

@kevin-lindsay-1 for Surge has requested this - but Waylay also wanted this for their integration cc @OcamsRazor

Expected Behaviour

A way to determine whether a 500 error was from the gateway / watchdog or from the function

Current Behaviour

There are some hints depending on the message body and headers, however no consistency right now.

List All Possible Solutions and Workarounds

  1. Add a header when handling an error condition in any of the OpenFaaS components
  2. Add a header when the watchdog receives a HTTP response from a function

Which Solution Do You Recommend?

I recommend 1 - because 2 depends on the use of the watchdog, which is not used by all users.

For 1 - the gateway needs a change since it can invoke functions directly. The provider also needs a change for when direct_functions is set to false and invocations flow through it instead. The watchdog should also have a change so that if it handles an error that can be passed back up the stream.

Headers do support multiple values for a key, i.e. X-OpenFaaS-Source: [watchdog, gateway]

For 1 - the watchdog does not need to be updated, and even when it's not in use, this header would still propagate and flow.

Then there's a wider conversation about how the queue-worker should retry errors when the "X-OpenFaaS-Source" header is present - assuming that these need to be retried due to an error during scaling - or node eviction.

kevin-lindsay-1 commented 1 year ago

I think that options 1 and 2 should both be included, and that option 2 should be opt-in for the sake of backwards compatibility.

Here's a truth table for describing the cases using only option 1:

source x-openfaas-component
infrastructure/function
gateway gateway
queue worker queue-worker
watchdog watchdog
multiple queue-worker; gateway

As you can see, in this approach you don't know if the response is coming from infrastructure (non-openfaas, non-function. k8s, istio, etc) or from the function.

Here's another table using only option 2:

source x-openfaas-component
infrastructure/gateway/queue worker/watchdog
function function

This option only lets you know if the response came from a function. All cases other than the function are effectively unknown as to what the source was.

Here's a final table using both:

source x-openfaas-component
infrastructure
gateway gateway
queue worker queue-worker
watchdog watchdog
function function
multiple queue-worker; gateway

This approach allows you know the most; it allows you to know if the response came from inside openfaas, the user's function, or some other unknown/infrastructural source.


As for whether or not option 1 should be "only when an OF component has an error", I would like to clarify that I don't think that's specifically necessary to call out, as the source of a response should effectively be invisible to the user unless there's an error. In other words, there's no specific reason why the x-openfaas-component cannot be set when the response is not an error, but in normal operating procedure, the other components are not responding, they are proxying, and therefore the response itself isn't coming from them.

If it matters, you could have a header stating every OF component that "touched" this request along the way, but that's more of a debug feature in my opinion.

alexellis commented 1 year ago

Summarised from Kevin on weekly call:

Responses coming from OpenFaaS components should be retried according to a default retry policy - i.e. defined in Helm chart.

Responses not coming from OpenFaaS components would be retried according to the configured policy on the component i.e. queue-worker / connector.

Responses coming from function would use the retry configuration set through annotations on the function. If no annotations are set the function would not be retried. This functionality should be opt-in and can be turned on via a parameter in the chart. This would need the watchdog update.