kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.88k stars 1.44k forks source link

How can we find right value of sleep time for zero downtime rolling update? #2106

Closed sechunOH closed 3 years ago

sechunOH commented 3 years ago

Hello, I'm using ALB controller v2.2.0 and Ingress with instance target type.

I want to know how to rollingupdate with zero downtime.

I investigated some tests.

1. without preStop hook and terminationGracePeriodSeconds(default 30s)
- some 502 errors when rolling update

2. preStop hook sleep: 40s, terminationGracePeriodSeconds: 70s
- 502 erros are very rarely found

3. preStop hook sleep: 70s, terminationGracePeriodSeconds: 100s
- 502 errors are not found (yet)

According to the comments, is it not proper to set 40s? https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1719#issuecomment-743437832, https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1719#issuecomment-743423433 (controller process time + ELB API propagation time + HTTP req/resp RTT + kubeproxy's iptable update time)

How can I find right value of sleep time? Long sleep time means that the more containers can be run simultaneously, I think.

And, Is the sleep time not related with deregistration delay of target group? (the target group used in above tests is set to 300s deregistration delay, but just 70 seconds sleep is enough to remove 502 errors)

Expected outcome

no downtime deployment without 502 errors.

Environment

Additional Context:

test script(macos) :

#!/bin/bash

count_ok=0
count_not_ok=0

for (( ; ; ))
do
  beginTime=`gdate +%s%3N`
  status_code=$(curl --write-out %{http_code} --silent --output /dev/null {test_url})
  endTime=`gdate +%s%3N`
  elapsed=`echo "($endTime - $beginTime)" | bc`
  echo StatusCode: $status_code, RP: $elapsed msec
  if [ $status_code -eq 200 ]; then
    ((count_ok=count_ok+1))
  else
    ((count_not_ok=count_not_ok+1))
  fi
  echo 200: $count_ok, not 200: $count_not_ok
done
M00nF1sh commented 3 years ago

@sechunOH

controller process time: it depends on the size/load of your cluster, you should be able to get that from the controller's metrics. ELB API propagation time: we checked with ELB team, they don't have P99 for that. but it should be less than 60 second for ALB. HTTP req/resp RTT: depends on your application kubeproxy's iptable update time: ranges between 10-30 second. so it's take it capped to be 30 second.

So it's not proper to set as 40 second, and we currently don't offer an optimal setting as well as there is a lot variants above. You should tune it according to your application and cluster usage.

sechunOH commented 3 years ago

@M00nF1sh Thank you for your details. Which metric should I monit for controller process time? Just keep looking the alb controller logs?

< additional questions > I tested some cases, I cannot get anything about that.

In case of rolling update, pod status will be changed like below.

Running -> Terminating -> Terminated

More detail between "Terminating" -> "Terminated"

Terminating-> (preStop hook time) -> sent sigterm -> (sent sigkill if not terminated) -> Terminated

I set the preStop hook timeout with 150s(over 2minutes), but health check request continues until the nodejs application terminated with sigterm.

Is it implemented that ALB deregister "terminating" status Pod in case using "instance target type"? If not, how can I redeploy applications with zero downtime using preStop hook?

I think there is no way for applications to aware that if they are in preStop hook, so they cannot restrict health check requests from ALB. Finally, the applications are terminated with no response for some requests. (If there is no graceful shutdown logic in applications)

Couldn't I redeploy with zerodowntime using just k8s and ALB? (Of course, I can implement sigterm(in my case, sigint from PM2) handling logic, set keepalive timeout more than idle connection timeout of ALB and use kill timeout of PM2 for graceful shutdown)

What is the expected behaviour of k8s and ALB controller in preStop duration? (I think pod readinessGates fit to IP target type, right?)

The most important thing is that when ALB does not check the health for "terminating" pod any more.

sechunOH commented 3 years ago

@M00nF1sh I tested it more, then I realized there must be some seconds for ALB to deregister "terminating" pods.

Above comment, I was confused why alb checks health of terminating pod. After some tests, I realized that was not ALB's health check, but readiness probe check. (I set same path for readiness probe check and target health check)

In our case, it is enough to prestop 5 seconds. Then ALB does not pass traffics to "terminating" pods anymore. After prestop, I handle SIGTERM signal for graceful shutdown our nodejs applications. (handling keep-alive connections, set keep-alive time bigger than ALB idle connection timeout, etc)

I'm very thankful of your conversation. Have luck.

MatthiasWinzeler commented 2 years ago

FWIW, if someone faces the same issue and stumbles upon this thread:

we ran into the same issue and contacted AWS support. The statement was that it can indeed happen that after deregistration the ALB still might send new requests to the target. This should be compensated with a pre stop sleep - they recommended 60 secs to be on the safe side.

with 60 secs, our load tests did not show any 502 errors during rolling upgrades.

Also we were told that the issue (ALB sending requests to draining targets) should be fixed in the future, so we expect to eventually be able to decrease the pre stop sleep to a lower number.

jyotibhanot commented 2 years ago

@MatthiasWinzeler : What should be the value of pre stop hook? Some posts suggest that terminationGracePeriodSeconds > preStop sleep > deregistration delay and other suggests that PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT. How can we calculate the prestop hook value?

MatthiasWinzeler commented 2 years ago

@jyotibhanot We don't have long running requests (which I think would require to respect the deregistration delay). So for us only terminationGracePeriodSeconds > preStop and controller process time + ELB API propagation time + HTTP req/resp RTT appear to matter.

To figure it out for your use case, AWS recommended us to just test it with your applications under some realistic load.