Per-function ack_wait - Githubissues

In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker ack_waits for multiple functions using 1 "global" setting, or on a per-function basis.

Expected Behaviour

When discussing multiple functions being listened to at the same time on a single queue worker, we discussed potentially preferred behavior, in the effort to have a single queue worker, and have it autoscale to meet demand, rather than having a static replica count and different wait times per queue.

Given the following:

1 queue worker with an ack_wait of 3m15s and max_inflight of 2
1 function sleep1 with duration of 1m and write_timeout of 1m5s
1 function sleep2 with duration of 3m and write_timeout of 3m5s

We assume a kubernetes environment or environment with a similar orchestration layer and pattern to kubernetes, and we assume the event triggering the pod is a graceful shutdown command, such as a Node draining for maintenance and scheduling resources on a different Node.

Expecting events with rough timing; the sections in the format [duration] are the general timings from the start of this example timeline

queue worker is subscribed to channel(s) [0s]
sleep1 is invoked via gateway and sent to nats [0s]
sleep2 is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats for sleep1 [0s]
queue worker receives a message from nats for sleep2 [0s]
queue worker begins function invocation for sleep1 call [0s]
queue worker begins function invocation for sleep2 call [0s]
queue worker receives SIGTERM (via drain), a new queue worker is scheduled to replace it [5s]
- we assume that graceful shutdown does not occur here, either because it's not currently implement, or because something unexpected happens. why it doesn't ack is out of scope of this issue, but we can assume it is sent a SIGKILL for this example.
new queue worker comes online, subscribes [7s]
sleep1 invocation completes, is not acknowledged [1m]
new queue worker receives a message from nats for sleep1 [1m5s]
new queue worker invokes sleep1 [1m5s]
sleep1 invocation completes, is handled by queue worker [2m5s]
sleep2 invocation completes, is not acknowledged [3m]
new queue worker receives a message from nats for sleep2 [3m5s]
new queue worker invokes sleep2 [3m5s]
sleep2 invocation completes, is handled by queue worker [6m5s]

Current Behaviour

An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:

queue worker is subscribed to channel (note: only 1 channel) [0s]
sleep1 is invoked via gateway and sent to nats [0s]
sleep2 is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats for sleep1 [0s]
queue worker receives a message from nats for sleep2 [0s]
queue worker begins function invocation for sleep1 call [0s]
queue worker begins function invocation for sleep2 call [0s]
queue worker receives SIGTERM (via drain), a new queue worker is scheduled to replace it [5s]
- we assume that graceful shutdown does not occur here, either because it's not currently implement, or because something unexpected happens. why it doesn't ack is out of scope of this issue, but we can assume it is sent a SIGKILL for this example.
new queue worker comes online, subscribes [7s]
sleep1 invocation completes, is not acknowledged [1m]
sleep2 invocation completes, is not acknowledged [3m]
- note that in the previous example the second invocation of sleep1 had already completed and been handled by this point
new queue worker receives a message from nats for sleep1 [3m15s]
new queue worker receives a message from nats for sleep2 [3m15s]
new queue worker invokes sleep1 [3m15s]
new queue worker invokes sleep2 [3m15s]
sleep1 invocation completes, is handled by queue worker [4m15s]
sleep2 invocation completes, is handled by queue worker [6m15s]

The major differences from the above:

sleep1's result takes 4m15s to finally come through, vs 2m5s in the previous example
sleep2's result takes 6m15s to finally come through, vs 6m5s in the previous example
in the previous example, ack_wait for the queue worker itself becomes functionally irrelevant (graceful shutdown is a different issue)

Possible Solution

As we had discussed previously, it would likely be advantageous to have different ack_wait times per function, and instead have a single queue worker that simply has no ack_wait of its own, and rather only knows about a graceful shutdown duration, which the user would have to configure in advance, knowing the ack_wait of their environment's longest running function.

The difference in this implementation would likely be to have multiple subscriptions with different AckWait periods in the queue worker, which may require more channels, rather than the current implementation, which only listens to 1 channel (based on what I see in the environment variables, specifically the variable faas_nats_channel).

The issue

The big wrench in this discussion is the max_inflight for a particular function's queue. For example, let's say I have a function with a concurrency limit of 100 (watchdog.max_inflight) and a maximum pod count of 10 (com.openfaas.scale.max). From those values, you can presume that that function's queue should not have more than 1000 (queueWorker.max_inflight) being attempted at once, because otherwise you'd be trying to send invocations to a function that would not be able to handle the request because all pods are busy.

The questions that occurs to me which effectively prevent this solution from actually working as expected is:

how does a queue worker know how many maximum in-flight invocations it should be able to send to a function?

I would say that this could be calculated by watchdog.max_inflight * com.openfaas.scale.max. The queue worker would then potentially not need its own max_inflight, and instead be able to be autoscaled based on cpu/memory or a custom metric.

how does a queue worker know how many are already in-flight by other replicas?

I would say that this doesn't have to be perfectly immediate between pods, and you could potentially accomplish this an external lookup (metrics or some such), and as long as it's able to prevent an endless flood of 429s, it should be fine with not being immediate (at least, for the pro queue worker, which can retry on 429s from pods).

Steps to Reproduce (for bugs)

Context

We would like to be able to have 1 queue worker handle multiple functions with different timings for retries (nats redelivery).
We would like the queue workers to be able to understand the realistic maximum number of invocations for a particular function, so as to not hit busy pods.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ): 0.13.13
Docker version docker version (e.g. Docker 17.0.05 ): 20.10.8
What version and distriubtion of Kubernetes are you using? kubectl version server v1.21.3 client v1.22.2
Operating System and version (e.g. Linux, Windows, MacOS): MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel

openfaas / nats-queue-worker

Per-function ack_wait #115