k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.13k stars 2.35k forks source link

Embedded load-balancer behavior is flakey and hard to understand #11334

Open brandond opened 5 days ago

brandond commented 5 days ago

The loadbalancer server list is a bit of a mess. its behavior has been tinkered with a lot over the last year, but it's still hard to reason about. This has caused a spate of issues:

From a code perspective, the loadbalancer state is directly accessed by a number of functions that all poke at various index vars, current and default server name vars, a list of server addresses, another RANDOM list of server addresses, and a map of addresses to structs that hold state: https://github.com/k3s-io/k3s/blob/cd4ddedbc9782cbe9b5dcc411df2addae7b2f3b4/pkg/agent/loadbalancer/loadbalancer.go#L43-L53

The DialContext function is called whenever a new connection comes in, and holds a read lock while iterating (possibly twice) over the random server list, and servers may be added or removed at any time. The code is VERY hard to read and understand, given the number of variables involved: https://github.com/k3s-io/k3s/blob/cd4ddedbc9782cbe9b5dcc411df2addae7b2f3b4/pkg/agent/loadbalancer/loadbalancer.go#L162-L208

We should simplify the load-balancer behavior so that it functions more reliably, and its functionality is easier to understand and explain.