Over time the ClusterCacheTracker evolved significantly. While it generally improved we have a few systematic issues that we should solve:
the current locking and client creation behavior is not good, whenever a reconciler calls GetClient it has a chance to be stuck for 10 seconds if the wl cluster is unreachable (because then GetClient might tries to create a client and then times out after 10s). While one reconciler tries to create a client other reconcilers will permanently requeue (currently with requeueAfter 1m). See also https://github.com/kubernetes-sigs/cluster-api/issues/10819
I think in general there is a huge potential to make the locking smarter
the current health checking code starts 1 go routine for every cluster (so 1k clusters => 1k health check goroutines)
ClusterCacheTracker requires to also add the ClusterCache reconcile to the manager. Because ClusterCache reconcile is a separate component that is easily forgotten
I probably missed a few :)
We also have some additional requirements:
We want to expose some information about the health checking state so that controllers can tell when the connection to the wl cluster broke and e.g. set the RemoteConnectionProbe condition accordingly (xref: https://github.com/kubernetes-sigs/cluster-api/pull/10897)
Over time the ClusterCacheTracker evolved significantly. While it generally improved we have a few systematic issues that we should solve:
I probably missed a few :)
We also have some additional requirements:
Tasks:
Follow-up Tasks:
Backlog: