Closed programmerq closed 2 years ago
Sounds like a PR setting minReadySeconds: 15
would be a good idea in the first instance.
I don't think minReadySeconds
is meant to be used like this nor that it would solve the root cause: agents are failing to reconnect quickly. Semantically the readiness passing means the pod is ready to receive traffic. Pods should receive traffic as soon as they become ready (and if the readiness reports ready too early, the issue lies with the readiness). In our case we don't want to delay the new pods, we want to avoid removing the old pods too quickly.
Another issue which is likely happening during the fast rollouts you observed is that loadbalancers/kube-proxy/whatever is sending traffic is not keeping up with rollout, causing requests to be routed to already dead pods, or the loadbalancing pool to become empty, causing a 20-sec downtime. This can be addressed:
minReadySeconds
, but we will be sure we don't loose in-flight requests.For those reasons I would suggest doing the following changes:
maxUnavailable
to 0 and maxSurge
to 1 (configurable if people have very large deployments)The root cause isn't necessarily that agents fail to reconnect quickly-- that's kind of an intentional design of this type of system. Agents back off on retries to avoid exacerbating an issue with an overwhelmed endpoint that may be overloaded.
minReadySeconds specifically gives buffer time after a pod becomes ready before the next pod is killed. This is pretty much exactly what the agents want. It gives agents a few tries to connect to the new pod before they lose another connection, thereby reducing its own availability.
It is indeed also very possible that the LoadBalancers and other components could be exacerbating the issue. The pod readiness gate and cnlb options you mentioned look great! That seems like it would help out with those exacerbating concerns for sure.
The preStop hook would likely help the end-user experience since their connections would be routed to pods that aren't about to be killed in that last 30 seconds. That wouldn't help the agent connect any faster to a new pod once it's up and running. It won't however, help a case when a pod is newly up and running and may have several missing tunnels. Any user traffic that lands on a fresh proxy instance is much more likely to hit an error due to no tunnel found.
The minReadySeconds will still help give reconnecting agents a chance to connect to a new pod before the deployment proceeds. Without that buffer, there will very quickly be several brand new pods all missing several of the agent tunnels. Yes, the user traffic can reach the proxy, but those proxies can't reach the resources the users want to access until those tunnels are in place.
I'm also doing a bit more brainstorming on this concern. The proxies handle multiple types of traffic and the readiness of those pods isn't uniform.
Ideally, a user wouldn't get routed to a pod until that pod has at least a large percentage of expected active tunnels.
In order to support that goal, that new pod should be weighted favorably when a tunnel connection request comes in.
As far as I'm aware, native Kubernetes load balancing and networking isn't able to deal with traffic on the same service like that. This is especially the case when multiplexing is introduced. An additional component would need to be introduced to do the layer 7 routing algorithm described above.
It'll be interesting to see how proxy peering changes this concern.
For now, templating out a minReadySeconds from the values of the chart with a sane low default would be very much appreciated!
It won't however, help a case when a pod is newly up and running and may have several missing tunnels. Any user traffic that lands on a fresh proxy instance is much more likely to hit an error due to no tunnel found.
Ah, I'm starting to understand. I'm not familiar with how agents are opening tunnels with the proxies. In a non-proxy-peering mode, afaik the agents have to open a tunnel with each proxy pod individually. How are they discovering proxies' individual address? I did not find any resource in the chart like a headless service that could allow agents to connect to multiple proxies. I guess we are not brute-forcing our way through the LB until we have an open connection with every proxy instance.
I guess we are not brute-forcing our way through the LB until we have an open connection with every proxy instance.
The agents are indeed brute forcing their way through the load balancer to establish tunnels with each proxy.
An agent will get an updated list of proxies as they are added. As long as there are more proxies than the agent has tunnels, it will make new connections over and over until it has a tunnel to each proxy.
In a case where a three proxy cluster has one proxy pod cycled, the agent will be aware of four proxies, but will have only two active tunnels. Since there are only three that are running, the agent has a 33% chance that it will connect to the new one. My statistics are a bit rusty, but I believe that with a 33% chance to get the correct one, the average connection attempt count for each agent will be three. Additionally, those agents will continue to attempt connections since they see that they don't have a connection to the now dead proxy. Once the dead proxy expires from the backend, the agents will see that they have tunnels to every proxy and stop their retry loops.
I actually wrote this problem statement on a statistics/math forum when trying to calculate the correct minReadySeconds value for a given number of replicas:
Someone pointed out that this was similar to the coupon collector's problem, but the exponential backoff makes it more complicated.
I was looking into it to see if we could calculate the optimal minReadySeconds on-the-fly. I figure our values.yaml could just give some high-level advice on how to choose the minReadySeconds based on replica count. Basically, the higher, the more time agents have to re-establish tunnels, but the longer the entire deploy operation will take.
Wow, that is an interesting approach. I'll definitely take a look and do the math to estimate the adequate wait time.
The minReadySeconds will still help give reconnecting agents a chance to connect to a new pod before the deployment proceeds.
Based on your explanation about the connection strategy, I don't think this statement is correct. The new pod will be created but will stay unready for minReadySeconds
. This means agents will try to connect to it but will never manage, because, as the pod is not ready, it will be kept out of the loadbalancing pool. After minReadySeconds
, the pod will become ready and reachable (both by users and agents) but the agents will have tried a couple of times already, consuming the early retries of the exponential backoff. For this strategy to work, we would need to connect agents through a service with publishNotReadyAddresses
.
I think there are two issues here:
I don't think we can easily address the first one, maybe by adding a second LB not checking the readiness (but it would not work with TLS multiplexing. I still think the second issue should be addressed with a preStop and a surge update instead of a minReadySeconds.
I actually did testing with minReadySeconds yesterday to confirm the behavior.
I observed that as soon as a new pod passed its readinessProbe, its IP address was immediately listed in the EndpointSlice and it started receiving traffic immediately. The deployment rollout would pause for the number of seconds specified by minReadySeconds before taking the next pod down.
The wording here doesn't explain the distinction between ready and available, which is what prompted me to do the test.
When a pod is Ready, that is all it needs to be included with the matches for a service/endpoint.
The concept of "availability" is referring to the deployment or statefulset update/rollout procedures. Availability is tracked during a rollout/deploy so that constraints like maxUnavailable can be satisfied. In order for a pod to be counted as available, it needs to have been created, scheduled, started, initialDelaySeconds elapse, liveness probe passes, readiness probe passes, and then minReadySeconds elapses. The service will have the ready pod added to it at the beginning of minReadySeconds, but the deployment will pause before it feels like that pod is truly available as far as the deployment or statefulset update is concerned.
Or to put it another way, kubectl get deployment
and kubectl get statefulset
show an AVAILABLE. The replicaset and pod objects do not. Services work on pod selectors and don't have a concept of the status of a corresponding deployment, replicaset, or statefulset.
Makes sense, thanks for taking the time to explain :pray: .
I'll send a PR adding minReadySeconds and setting maxUnavailable to 0.
I think that maxUnavailable doesn't need to change to zero. Having that set to zero can interfere with some use-cases like having a PVC in extraVolumes/extraVolumeMounts, or custom/standalone chart modes with storage enabled. It can get into a weird state where it won't tear down the old pod yet, but the new one is pending because it can't attach the pv.
I initially checked if chartMode != standalone, but forgot the custom mode. I guess it's safe to set rolling update strategy if replica >= 2.
About your stat challenge I got nerdsniped and did the math for a single agent:
Let N
be the amount of pods, Y
the amount of retries needed, P
the desired probability (0.9 to be 90% sure the agent reconnected after Y
retries), this gives us
Y >= ( ln(1-P)/ln( (N-1)/N) ) - 1
We can major the waiting time by the worst case: it's 4n
each time, n
being the amount of retries.
Let T
, be the time in seconds, ln
is the natural logarithm:
T = 2 * ( ( ln ( 1-P ) / ln ( (N-1)/N )) ** 2 - ( ln ( 1-P ) / ln ( (N-1)/N )) )
This can be simplified, but not that much. I don't think it makes sense to build this in the chart, it's way too complex. I can compute a couple of values for N
and P
to get an order of magnitude. ~The probability for all agents to reconnect is following the same formula but you have to replace P
by exp(ln(P)/a)
where a
is the amount of agents.~
Example results:
Connection probability 0.5, 2 Pods -> 0.00 retries, between 0.00 and 0.00 seconds
Connection probability 0.5, 3 Pods -> 0.71 retries, between 1.72 and 3.43 seconds
Connection probability 0.5, 5 Pods -> 2.11 retries, between 10.98 and 21.96 seconds
Connection probability 0.5, 7 Pods -> 3.50 retries, between 27.95 and 55.90 seconds
Connection probability 0.5, 9 Pods -> 4.88 retries, between 52.61 and 105.22 seconds
Connection probability 0.7, 2 Pods -> 0.74 retries, between 1.82 and 3.65 seconds
Connection probability 0.7, 3 Pods -> 1.97 retries, between 9.73 and 19.45 seconds
Connection probability 0.7, 5 Pods -> 4.40 retries, between 43.04 and 86.07 seconds
Connection probability 0.7, 7 Pods -> 6.81 retries, between 99.57 and 199.14 seconds
Connection probability 0.7, 9 Pods -> 9.22 retries, between 179.31 and 358.62 seconds
Connection probability 0.9, 2 Pods -> 2.32 retries, between 13.10 and 26.21 seconds
Connection probability 0.9, 3 Pods -> 4.68 retries, between 48.46 and 96.93 seconds
Connection probability 0.9, 5 Pods -> 9.32 retries, between 183.00 and 366.00 seconds
Connection probability 0.9, 7 Pods -> 13.94 retries, between 402.43 and 804.86 seconds
Connection probability 0.9, 9 Pods -> 18.55 retries, between 706.71 and 1413.42 seconds
Connection probability 0.99, 2 Pods -> 5.64 retries, between 69.35 and 138.70 seconds
Connection probability 0.99, 3 Pods -> 10.36 retries, between 224.92 and 449.85 seconds
Connection probability 0.99, 5 Pods -> 19.64 retries, between 790.92 and 1581.83 seconds
Connection probability 0.99, 7 Pods -> 28.87 retries, between 1696.34 and 3392.69 seconds
Connection probability 0.99, 9 Pods -> 38.10 retries, between 2941.13 and 5882.26 seconds
So 15 seconds are giving a 70% probability of reconnection per agent.
Current Behavior:
Doing a
helm upgrade
for a teleport-cluster instance should reduce the disruption to connected agents. Currently, the pods can be cycled very quickly, which can mean that agents drop down to zero active tunnel connections, causing them to go into a "degraded" state themselves. See also: #13122The helm chart sets no particular deployment strategy for updates. The default deployment strategy in kubernetes is the RollingUpdate strategy set to allow a maxSurge of 25% and a minUnavailable of 25%. The problem is that if the new pods become healthy very quickly, the old pods can be killed equally as quickly, which is more difficult for agents to keep up with.
Agents have a backoff reconnect attempt that can range from a couple of seconds for the first retry, but gets longer and longer on each subsequent attempt.
Expected behavior:
The kubernetes deployment strategy should take into account the pause that many agents may need in order to get connected to the new pod. It doesn't necessarily need to ensure that every agent is able to establish a connection, but there should be ample time for new tunnel connections to be established before proceeding to the next pod in the rolling update.
This could be accomplished by adding the
minReadySeconds
field to the pod spec in the Deployment/Statefulset template for teleport. A value of 15 would give agents plenty of time to at least try 1-2 times to establish a connection.I tested
minReadySeconds: 15
on a deployment with three replicas, and it definitely spaced out the pods and gave the agents a chance to connect to the new pod. I was also able to see that the EndpointSlice that corresponds with this deployment showed that the new pod was ready as soon as the checks returned ready, and it would be another 15 seconds before it would cycle the next pod, meaning this guarantees the new pod is accessible by the kubernetes service for 15 seconds as opposed to waiting an extra 15 seconds before agents can connect.A cleaner implementation than minReadySeconds would be wonderful, but it appears to be a quick and easy way to help reduce the disruption for intentional deploys.
Additionally, the strategy for deployment and statefulsets as a whole should be configurable by the end user.
Bug details:
gz#5346