Closed kevin-lindsay-1 closed 1 year ago
Hi @kevin-lindsay-1 - we reviewed this on the office hours call, do you have steps for a repro please?
Steps to Reproduce (for bugs)
Alex
I made this a while ago. What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.
As far as I know, it doesn't gracefully shutdown as described, full stop. I've been watching these queue workers for over a year now and I don't think it's ever once initiated any kind of behavior that pointed towards graceful shutdown being implemented.
What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.
I'd say: something that proves that this behaviour is the case, and what impact a non-graceful shutdown might have?
What would the minimum useful setup be to demo its impact and suggest what benefits a graceful shutdown could bring?
In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker gracefully shuts down, or if it just abandons its work.
Expected Behaviour
The behavior we discussed that we desired was that the queue worker attempts to gracefully shut down by:
An example of this timing for a
sleep
function with the following config:30s
[x]_timeout
s of1m
ack_wait
of1m5s
We assume a kubernetes environment or environment with a similar orchestration layer and pattern to kubernetes, and we assume the event triggering the pod is a graceful shutdown command, such as a Node draining for maintenance and scheduling resources on a different Node.
Expecting events with rough timing; the sections in the format
[duration]
are the general timings from the start of this example timeline:[0s]
[0s]
[0s]
SIGTERM
(via drain), a new queue worker is scheduled to replace it[5s]
[5s]
[7s]
[30s]
[30s]
[30s]
[30s]
Current Behaviour
Currently the queue worker immediately exits, I don't even see a log such as "received SIGTERM" or the like. Once the queue-worker comes back online, nats eventually sends the message again.
An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:
[0s]
[0s]
[0s]
SIGTERM
(via drain), a new queue worker is scheduled to replace it[5s]
[5s]
[7s]
[30s]
[1m5s]
[1m5s]
[1m5s]
[1m35s]
[1m35s]
The two major differences from the above:
ack_wait
duration, meaning a process that should take30s
instead takes1m35s
(function duration + ack_wait duration)Possible Solution
Steps to Reproduce (for bugs)
Context
We are interested in the timing of jobs, as well as not duplicating function invocations, if graceful shutdown were implemented, we could expect certain invocations to not wait for the full
ack_wait
duration before attempting the function again.Your Environment
FaaS-CLI version ( Full output from:
faas-cli version
): 0.13.13Docker version
docker version
(e.g. Docker 17.0.05 ): 20.10.8What version and distriubtion of Kubernetes are you using?
kubectl version
server v1.21.3 client v1.22.2Operating System and version (e.g. Linux, Windows, MacOS): MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel