Open kazimierzbudzyk opened 3 years ago
Here are my findings after reproducing and observing this scenario:
Clients are connecting directly to the NLB which routes traffic according to the following:
- Selects a target from the target group for the default rule using a flow hash algorithm. It bases the algorithm on:
- The protocol
- The source IP address and source port
- The destination IP address and destination port
- The TCP sequence number
- Routes each individual TCP connection to a single target for the life of the connection. The TCP connections from a client have different source ports and sequence numbers, and can be routed to different targets.
Since gRPC reuses TCP connections after they have been established, clients are essentially pinned to a particular Auth server. By restarting an Auth server in this scenario, the NLB eventually detects that the restarted server is not healthy and reroutes traffic according to the same rules above. The clients then again become pinned to a new Auth server. By the time the restarted Auth server is healthy again, it is likely to not receive much traffic if all the clients have already established a connection with the other servers.
In order to redistribute traffic evenly across all Auth servers there are a few options:
To utilize client side load balancing would be a substantial change and wouldn't totally solve the request to automatically spread out load after an Auth server restarts. Adding some heuristic to Auth to shed or deny connections in an attempt to spread load could also lead to usability issues.
The Auth client should have some way to detect a continuous stream of errors and attempt to reconnect to another Auth server. This would allow clients to proactively load balance themselves without relying on the load balancer to detect an outage. In addition to this, we could add the ability for a random group of Auth clients to disconnect after some period of time to prevent long lived connections from spreading load.
After doing some testing it does appear that reconnecting grpc connections does spread the load out. However doing so in a manner that doesn't impact user experience may not be achievable. A grpc server can only forcibly close a client if it can some how get a handle to the underlying tcp connection or via MaxConnectionAge
in its grpc keepalive
https://pkg.go.dev/google.golang.org/grpc/keepalive?utm_source=godoc#ServerParameters. MaxConnectionAge
is a global setting on the grpc server though and it would limit all connections. Clients randomly reconnecting on a timed interval does work but has issues of its own.
Description
What happened: When restarting HA teleport auth cluster client nodes get redirected to one node resulting in uneven load spread. Only mass client node restart evens it out.
What you expected to happen: When restarting HA teleport auth cluster client nodes end up evenly balanced between auth servers (eventually). Without requiring client nodes being restarted.
Reproduction Steps
As minimally and precisely as possible, describe step-by-step how to reproduce the problem.
systemctl restart teleport
).process_open_fds
metric, or similar).Server Details
teleport version
): 6.1.5/etc/os-release
): CentOS7