openfaas / nats-queue-worker

Queue-worker for OpenFaaS with NATS Streaming
https://docs.openfaas.com/reference/async/
MIT License
129 stars 59 forks source link

Queue Worker does not gracefully shut down #114

Closed kevin-lindsay-1 closed 1 year ago

kevin-lindsay-1 commented 3 years ago

In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker gracefully shuts down, or if it just abandons its work.

Expected Behaviour

The behavior we discussed that we desired was that the queue worker attempts to gracefully shut down by:

An example of this timing for a sleep function with the following config:

We assume a kubernetes environment or environment with a similar orchestration layer and pattern to kubernetes, and we assume the event triggering the pod is a graceful shutdown command, such as a Node draining for maintenance and scheduling resources on a different Node.

Expecting events with rough timing; the sections in the format [duration] are the general timings from the start of this example timeline:

Current Behaviour

Currently the queue worker immediately exits, I don't even see a log such as "received SIGTERM" or the like. Once the queue-worker comes back online, nats eventually sends the message again.

An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:

The two major differences from the above:

Possible Solution

Steps to Reproduce (for bugs)

Context

We are interested in the timing of jobs, as well as not duplicating function invocations, if graceful shutdown were implemented, we could expect certain invocations to not wait for the full ack_wait duration before attempting the function again.

Your Environment

alexellis commented 2 years ago

Hi @kevin-lindsay-1 - we reviewed this on the office hours call, do you have steps for a repro please?

Steps to Reproduce (for bugs)

Alex

kevin-lindsay-1 commented 2 years ago

I made this a while ago. What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.

As far as I know, it doesn't gracefully shutdown as described, full stop. I've been watching these queue workers for over a year now and I don't think it's ever once initiated any kind of behavior that pointed towards graceful shutdown being implemented.

alexellis commented 2 years ago

What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.

I'd say: something that proves that this behaviour is the case, and what impact a non-graceful shutdown might have?

What would the minimum useful setup be to demo its impact and suggest what benefits a graceful shutdown could bring?