teleport-cluster should cycle pods more slowly

programmerq commented 2 years ago

Current Behavior:

Doing a helm upgrade for a teleport-cluster instance should reduce the disruption to connected agents. Currently, the pods can be cycled very quickly, which can mean that agents drop down to zero active tunnel connections, causing them to go into a "degraded" state themselves. See also: #13122

The helm chart sets no particular deployment strategy for updates. The default deployment strategy in kubernetes is the RollingUpdate strategy set to allow a maxSurge of 25% and a minUnavailable of 25%. The problem is that if the new pods become healthy very quickly, the old pods can be killed equally as quickly, which is more difficult for agents to keep up with.

Agents have a backoff reconnect attempt that can range from a couple of seconds for the first retry, but gets longer and longer on each subsequent attempt.

Expected behavior:

The kubernetes deployment strategy should take into account the pause that many agents may need in order to get connected to the new pod. It doesn't necessarily need to ensure that every agent is able to establish a connection, but there should be ample time for new tunnel connections to be established before proceeding to the next pod in the rolling update.

This could be accomplished by adding the minReadySeconds field to the pod spec in the Deployment/Statefulset template for teleport. A value of 15 would give agents plenty of time to at least try 1-2 times to establish a connection.

I tested minReadySeconds: 15 on a deployment with three replicas, and it definitely spaced out the pods and gave the agents a chance to connect to the new pod. I was also able to see that the EndpointSlice that corresponds with this deployment showed that the new pod was ready as soon as the checks returned ready, and it would be another 15 seconds before it would cycle the next pod, meaning this guarantees the new pod is accessible by the kubernetes service for 15 seconds as opposed to waiting an extra 15 seconds before agents can connect.

A cleaner implementation than minReadySeconds would be wonderful, but it appears to be a quick and easy way to help reduce the disruption for intentional deploys.

Additionally, the strategy for deployment and statefulsets as a whole should be configurable by the end user.

Bug details:

Teleport version 9.3.3
Recreation steps: see #13122 for more details

gz#5346

webvictim commented 2 years ago

Sounds like a PR setting minReadySeconds: 15 would be a good idea in the first instance.

hugoShaka commented 2 years ago

I don't think minReadySeconds is meant to be used like this nor that it would solve the root cause: agents are failing to reconnect quickly. Semantically the readiness passing means the pod is ready to receive traffic. Pods should receive traffic as soon as they become ready (and if the readiness reports ready too early, the issue lies with the readiness). In our case we don't want to delay the new pods, we want to avoid removing the old pods too quickly.

Another issue which is likely happening during the fast rollouts you observed is that loadbalancers/kube-proxy/whatever is sending traffic is not keeping up with rollout, causing requests to be routed to already dead pods, or the loadbalancing pool to become empty, causing a 20-sec downtime. This can be addressed:

by enabling pod readiness gates if the LB controller supports them (e.g. aws-lb-controller needs an annotation on the namespace, and google calls this cnlb)
by waiting 30 seconds between making the pod unready, and sending the stop signal. 30 seconds is the kube-proxy propagation time. This would imply also setting a 1min gracePeriod (to ensure the pod still have 30 seconds to shut down). The side-effect will be similar to what you were suggesting to do with minReadySeconds, but we will be sure we don't loose in-flight requests.

For those reasons I would suggest doing the following changes:

set maxUnavailable to 0 and maxSurge to 1 (configurable if people have very large deployments)
set a preStop hook doing a 30s sleep
increase the gracePeriod by 30sec

programmerq commented 2 years ago

The root cause isn't necessarily that agents fail to reconnect quickly-- that's kind of an intentional design of this type of system. Agents back off on retries to avoid exacerbating an issue with an overwhelmed endpoint that may be overloaded.

minReadySeconds specifically gives buffer time after a pod becomes ready before the next pod is killed. This is pretty much exactly what the agents want. It gives agents a few tries to connect to the new pod before they lose another connection, thereby reducing its own availability.

It is indeed also very possible that the LoadBalancers and other components could be exacerbating the issue. The pod readiness gate and cnlb options you mentioned look great! That seems like it would help out with those exacerbating concerns for sure.

The preStop hook would likely help the end-user experience since their connections would be routed to pods that aren't about to be killed in that last 30 seconds. That wouldn't help the agent connect any faster to a new pod once it's up and running. It won't however, help a case when a pod is newly up and running and may have several missing tunnels. Any user traffic that lands on a fresh proxy instance is much more likely to hit an error due to no tunnel found.

The minReadySeconds will still help give reconnecting agents a chance to connect to a new pod before the deployment proceeds. Without that buffer, there will very quickly be several brand new pods all missing several of the agent tunnels. Yes, the user traffic can reach the proxy, but those proxies can't reach the resources the users want to access until those tunnels are in place.

programmerq commented 2 years ago

I'm also doing a bit more brainstorming on this concern. The proxies handle multiple types of traffic and the readiness of those pods isn't uniform.

Ideally, a user wouldn't get routed to a pod until that pod has at least a large percentage of expected active tunnels.

In order to support that goal, that new pod should be weighted favorably when a tunnel connection request comes in.

As far as I'm aware, native Kubernetes load balancing and networking isn't able to deal with traffic on the same service like that. This is especially the case when multiplexing is introduced. An additional component would need to be introduced to do the layer 7 routing algorithm described above.

It'll be interesting to see how proxy peering changes this concern.

For now, templating out a minReadySeconds from the values of the chart with a sane low default would be very much appreciated!

hugoShaka commented 2 years ago

It won't however, help a case when a pod is newly up and running and may have several missing tunnels. Any user traffic that lands on a fresh proxy instance is much more likely to hit an error due to no tunnel found.

Ah, I'm starting to understand. I'm not familiar with how agents are opening tunnels with the proxies. In a non-proxy-peering mode, afaik the agents have to open a tunnel with each proxy pod individually. How are they discovering proxies' individual address? I did not find any resource in the chart like a headless service that could allow agents to connect to multiple proxies. I guess we are not brute-forcing our way through the LB until we have an open connection with every proxy instance.

programmerq commented 2 years ago

I guess we are not brute-forcing our way through the LB until we have an open connection with every proxy instance.

The agents are indeed brute forcing their way through the load balancer to establish tunnels with each proxy.

An agent will get an updated list of proxies as they are added. As long as there are more proxies than the agent has tunnels, it will make new connections over and over until it has a tunnel to each proxy.

In a case where a three proxy cluster has one proxy pod cycled, the agent will be aware of four proxies, but will have only two active tunnels. Since there are only three that are running, the agent has a 33% chance that it will connect to the new one. My statistics are a bit rusty, but I believe that with a 33% chance to get the correct one, the average connection attempt count for each agent will be three. Additionally, those agents will continue to attempt connections since they see that they don't have a connection to the now dead proxy. Once the dead proxy expires from the backend, the agents will see that they have tunnels to every proxy and stop their retry loops.

I actually wrote this problem statement on a statistics/math forum when trying to calculate the correct minReadySeconds value for a given number of replicas:

Click to expand

> Hello! > > I work with computer systems, and my calculus/statistics is super rusty. > > Basically, I have a system that will retry a task that fails. That task will only succeed _s_ percent of the time. Every iteration will pause a random amount of time between _2n_ and _4n_ seconds, where _n_ is the retry count. > > I am picturing writing a probability distribution function that describes how many seconds it will take for an arbitrarily large set of attempts. > > Since there is always a chance of failure on every _n_th attempt, I believe that the amount of time will reach infinity very slowly. > > Since I already know that I won't be able to calculate the amount of time to guarantee success, I'd like to pick another point on the graph, like perhaps 50% or 90% of all attempts having completed. > > This feels like a calculus and probability problem that I don't remember how to construct. I ended up on this wikipedia page, and it looks similar to the math I remember taking that would be relevant: https://en.wikipedia.org/wiki/Probability_density_function > > ---- > > The second part of this problem is the _s_ value. > > In my program, it is trying to connect to a specific ip address which drops me onto one of a certain number of systems. Once it successfully connects to one, I want the program to retry and repeat until it has connected to all of the instances behind that one ip address. > > If there are three instances, then the first attempt will succeed for sure. > > Then it'll have a 2/3 chance to connect to any remaining desired connection. It'll reset it's try count for this new round. > > When it finally succeeds connecting to the second of three, it will reset its try count again and have a 1/3 chance to connect to the last node. > > This makes it feel like it's really stacking three probability distributions and their associated elapsed time to get the actual programatic distribution for number of seconds elapsed for a certain threshold to have succeeded. > > As the instance count goes up (maybe I have 7 instead of just 3), I'm less confident in how to properly calculate things. > > ---- > > Here's a possible individual sequence of events. I need to connect to four backends: A, B, C, and D. > > First attempt, it connects to backend C. This is a success. > > While each connection attempt does take some time, it is negligible for this problem. > > now it moves on to try to connect to A, B, or D > > * attempt n=0 > * connects to backend C. This is a failure. > * attempt n=1 > * calculates how long to wait (random value between 2-4 seconds): 3.3 seconds > * connects to backend A. This is a success. > > Total wait time of 3.3 seconds > > now it must connect to B or D > > * attempt n=0 > * connects to backend C. This is a failure. > * attempt n=1 > * calculates how long to wait (random value between 2-4 seconds): 2.4 seconds > * connects to backend C. This is a failure. > * attempt n=2 > * calculates how long to wait (random value between 4-8 seconds): 7.2 seconds > * connects to backend A. This is a success. > > This took a total of 9.6 seconds. > > Now it must retry until it successfully connects to D > > * attempt n=0 > * connects to backend D. This is a success. > > This example sequence didn't have many retries, and only introduced 12.9 seconds of wait time to the whole operation. > > ---- > > I can visualize what a function/graph might look like for some of the steps, but I don't know how to express all of these behaviors in one function. > > I'm trying to be smart about how long to pause between steps in an upgrade procedure based on the available information (the exponential backoff algorithm and the number of servers that clients need to connect to). Pausing for the correct amount of time will allow for less disruption to the procedure I'm doing, without making it take an arbitrarily long amount of time. > > Let me know if I can clarify anything else, and thanks in advance!

Someone pointed out that this was similar to the coupon collector's problem, but the exponential backoff makes it more complicated.

I was looking into it to see if we could calculate the optimal minReadySeconds on-the-fly. I figure our values.yaml could just give some high-level advice on how to choose the minReadySeconds based on replica count. Basically, the higher, the more time agents have to re-establish tunnels, but the longer the entire deploy operation will take.

hugoShaka commented 2 years ago

Wow, that is an interesting approach. I'll definitely take a look and do the math to estimate the adequate wait time.

The minReadySeconds will still help give reconnecting agents a chance to connect to a new pod before the deployment proceeds.

Based on your explanation about the connection strategy, I don't think this statement is correct. The new pod will be created but will stay unready for minReadySeconds. This means agents will try to connect to it but will never manage, because, as the pod is not ready, it will be kept out of the loadbalancing pool. After minReadySeconds, the pod will become ready and reachable (both by users and agents) but the agents will have tried a couple of times already, consuming the early retries of the exponential backoff. For this strategy to work, we would need to connect agents through a service with publishNotReadyAddresses.

I think there are two issues here:

the correlation between agent and user readiness, as long as both are going through the same service we won't be able to ensure tunnels are here before a user connects
the speed of the deployment itself agents to get disconnected from all proxies

I don't think we can easily address the first one, maybe by adding a second LB not checking the readiness (but it would not work with TLS multiplexing. I still think the second issue should be addressed with a preStop and a surge update instead of a minReadySeconds.

programmerq commented 2 years ago

I actually did testing with minReadySeconds yesterday to confirm the behavior.

I observed that as soon as a new pod passed its readinessProbe, its IP address was immediately listed in the EndpointSlice and it started receiving traffic immediately. The deployment rollout would pause for the number of seconds specified by minReadySeconds before taking the next pod down.

The wording here doesn't explain the distinction between ready and available, which is what prompted me to do the test.

When a pod is Ready, that is all it needs to be included with the matches for a service/endpoint.

The concept of "availability" is referring to the deployment or statefulset update/rollout procedures. Availability is tracked during a rollout/deploy so that constraints like maxUnavailable can be satisfied. In order for a pod to be counted as available, it needs to have been created, scheduled, started, initialDelaySeconds elapse, liveness probe passes, readiness probe passes, and then minReadySeconds elapses. The service will have the ready pod added to it at the beginning of minReadySeconds, but the deployment will pause before it feels like that pod is truly available as far as the deployment or statefulset update is concerned.

Or to put it another way, kubectl get deployment and kubectl get statefulset show an AVAILABLE. The replicaset and pod objects do not. Services work on pod selectors and don't have a concept of the status of a corresponding deployment, replicaset, or statefulset.

hugoShaka commented 2 years ago

Makes sense, thanks for taking the time to explain :pray: .

I'll send a PR adding minReadySeconds and setting maxUnavailable to 0.

programmerq commented 2 years ago

I think that maxUnavailable doesn't need to change to zero. Having that set to zero can interfere with some use-cases like having a PVC in extraVolumes/extraVolumeMounts, or custom/standalone chart modes with storage enabled. It can get into a weird state where it won't tear down the old pod yet, but the new one is pending because it can't attach the pv.

hugoShaka commented 2 years ago

I initially checked if chartMode != standalone, but forgot the custom mode. I guess it's safe to set rolling update strategy if replica >= 2.

About your stat challenge I got nerdsniped and did the math for a single agent:

Let N be the amount of pods, Y the amount of retries needed, P the desired probability (0.9 to be 90% sure the agent reconnected after Y retries), this gives us

Y >= ( ln(1-P)/ln( (N-1)/N) ) - 1

We can major the waiting time by the worst case: it's 4n each time, n being the amount of retries.

Let T, be the time in seconds, ln is the natural logarithm:

T = 2 * ( ( ln ( 1-P ) / ln ( (N-1)/N )) ** 2  - ( ln ( 1-P ) / ln ( (N-1)/N )) )

This can be simplified, but not that much. I don't think it makes sense to build this in the chart, it's way too complex. I can compute a couple of values for N and P to get an order of magnitude. ~The probability for all agents to reconnect is following the same formula but you have to replace P by exp(ln(P)/a) where a is the amount of agents.~

hugoShaka commented 2 years ago

Example results:

Connection probability 0.5, 2 Pods -> 0.00 retries, between 0.00 and 0.00 seconds
Connection probability 0.5, 3 Pods -> 0.71 retries, between 1.72 and 3.43 seconds
Connection probability 0.5, 5 Pods -> 2.11 retries, between 10.98 and 21.96 seconds
Connection probability 0.5, 7 Pods -> 3.50 retries, between 27.95 and 55.90 seconds
Connection probability 0.5, 9 Pods -> 4.88 retries, between 52.61 and 105.22 seconds
Connection probability 0.7, 2 Pods -> 0.74 retries, between 1.82 and 3.65 seconds
Connection probability 0.7, 3 Pods -> 1.97 retries, between 9.73 and 19.45 seconds
Connection probability 0.7, 5 Pods -> 4.40 retries, between 43.04 and 86.07 seconds
Connection probability 0.7, 7 Pods -> 6.81 retries, between 99.57 and 199.14 seconds
Connection probability 0.7, 9 Pods -> 9.22 retries, between 179.31 and 358.62 seconds
Connection probability 0.9, 2 Pods -> 2.32 retries, between 13.10 and 26.21 seconds
Connection probability 0.9, 3 Pods -> 4.68 retries, between 48.46 and 96.93 seconds
Connection probability 0.9, 5 Pods -> 9.32 retries, between 183.00 and 366.00 seconds
Connection probability 0.9, 7 Pods -> 13.94 retries, between 402.43 and 804.86 seconds
Connection probability 0.9, 9 Pods -> 18.55 retries, between 706.71 and 1413.42 seconds
Connection probability 0.99, 2 Pods -> 5.64 retries, between 69.35 and 138.70 seconds
Connection probability 0.99, 3 Pods -> 10.36 retries, between 224.92 and 449.85 seconds
Connection probability 0.99, 5 Pods -> 19.64 retries, between 790.92 and 1581.83 seconds
Connection probability 0.99, 7 Pods -> 28.87 retries, between 1696.34 and 3392.69 seconds
Connection probability 0.99, 9 Pods -> 38.10 retries, between 2941.13 and 5882.26 seconds

So 15 seconds are giving a 70% probability of reconnection per agent.

gravitational / teleport

teleport-cluster should cycle pods more slowly #13129

Current Behavior:

Expected behavior: