timebertt / thesis-controller-sharding

Towards Horizontally Scalable Kubernetes Controllers (Study Project)
33 stars 1 forks source link

Additional reference #1

Closed evankanderson closed 8 months ago

evankanderson commented 1 year ago

Hey, someone pointed this thesis out to me because I'd been talking about Knative's controller HA mechanism. This looks like it shares some similarities but also has some differences (in particular, the label-limited watches and the sharder component).

One additional validation mechanism you might be interested in from the Knative implementation is the "chaos duck" component, which periodically terminates the elected leader for a shard. Used during an active loadtest, it has helped to find a few interesting edge cases.

mattmoor commented 1 year ago

There are also two implementations in Knative:

  1. Leader-election based where replicas lease shards, and
  2. StatefulSet based where replicas own the shard corresponding to their statefulset ordinal.

These have different trade-offs, the former is able to failover fairly quickly, but because leasing it racy and everything could still end up on a single shard you can't bound the worst-case downtime of a replica failing. In the StatefulSet form, you don't get the fast failover, but you do get guarantees about even key distribution over replicas, which allows you to bound the worst-case availability hit of losing a replica to 1/N vs. worst-case asymptotically 100%. Lease-based leader-election is also extremely chatty with the API server, and our statefulset based Knative serving distro exhibits ~10+x the API server load at rest vs. some of the stock eventing controllers we run (last time I checked)

Knative's implementation also managed to ~entirely hide the mechanics of this from the user-written reconciliation logic using our code-generator. When we rolled this out across Knative's (and Tekton's) controllers a few years ago, the only code changes needed were in a handful of unit test cases, which made me really happy.

Anecdotally, I'm fairly certain that there are some large installations of Tekton that are leveraging the StatefulSet-based leader election, and personally the StatefulSet mechanism is my bias whenever I write new controllers (or repackage existing ones 😉 ).

timebertt commented 8 months ago

Thanks for the hints! I considered them in my Master's thesis that continues the work from this study project: https://github.com/timebertt/masters-thesis-controller-sharding