gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.46k stars 1.75k forks source link

Uneven auth load after restarting HA auth cluster #7029

Open kazimierzbudzyk opened 3 years ago

kazimierzbudzyk commented 3 years ago

Description

What happened: When restarting HA teleport auth cluster client nodes get redirected to one node resulting in uneven load spread. Only mass client node restart evens it out.

What you expected to happen: When restarting HA teleport auth cluster client nodes end up evenly balanced between auth servers (eventually). Without requiring client nodes being restarted.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. Setup teleport auth in HA mode, with auth nodes behind NLB
  2. Restart one of the auth nodes (systemctl restart teleport).
  3. Observe client nodes unevenly getting spread out between auth nodes (via process_open_fds metric, or similar).

Server Details

rosstimothy commented 2 years ago

Here are my findings after reproducing and observing this scenario:

Clients are connecting directly to the NLB which routes traffic according to the following:

  1. Selects a target from the target group for the default rule using a flow hash algorithm. It bases the algorithm on:
    • The protocol
    • The source IP address and source port
    • The destination IP address and destination port
    • The TCP sequence number
  2. Routes each individual TCP connection to a single target for the life of the connection. The TCP connections from a client have different source ports and sequence numbers, and can be routed to different targets.

https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html

Since gRPC reuses TCP connections after they have been established, clients are essentially pinned to a particular Auth server. By restarting an Auth server in this scenario, the NLB eventually detects that the restarted server is not healthy and reroutes traffic according to the same rules above. The clients then again become pinned to a new Auth server. By the time the restarted Auth server is healthy again, it is likely to not receive much traffic if all the clients have already established a connection with the other servers.

In order to redistribute traffic evenly across all Auth servers there are a few options:

  1. Leverage gRPC client side load balancing instead of relying on a NLB
  2. Make Auth servers cognizant of load and deny or drop connections accordingly
  3. Handle back pressure better and add the ability for a client to potentially close and reconnect after a period of time

To utilize client side load balancing would be a substantial change and wouldn't totally solve the request to automatically spread out load after an Auth server restarts. Adding some heuristic to Auth to shed or deny connections in an attempt to spread load could also lead to usability issues.

The Auth client should have some way to detect a continuous stream of errors and attempt to reconnect to another Auth server. This would allow clients to proactively load balance themselves without relying on the load balancer to detect an outage. In addition to this, we could add the ability for a random group of Auth clients to disconnect after some period of time to prevent long lived connections from spreading load.

rosstimothy commented 2 years ago

After doing some testing it does appear that reconnecting grpc connections does spread the load out. However doing so in a manner that doesn't impact user experience may not be achievable. A grpc server can only forcibly close a client if it can some how get a handle to the underlying tcp connection or via MaxConnectionAge in its grpc keepalive https://pkg.go.dev/google.golang.org/grpc/keepalive?utm_source=godoc#ServerParameters. MaxConnectionAge is a global setting on the grpc server though and it would limit all connections. Clients randomly reconnecting on a timed interval does work but has issues of its own.