spiffe / helm-charts-hardened

Apache License 2.0
12 stars 26 forks source link

move spire-controller-manager to a separate pod #341

Open drewwells opened 2 months ago

drewwells commented 2 months ago

For background (skip if you know this), ingress and k8s services only send traffic to pods marked ready. If any container in the pod is not marked ready, no traffic will be sent to the pod. This is to handle zero downtime rotations of pods in replicasets.

The spire-server and spire-controller-manager have different roles in spire. spire-server is responsible for API and serving requests. If it's down, especially in the statefulset deployment, spire eventually stops working entirely. However, spire-controller-manager is responsible for managing CRs in the cluster. If it's down, the impact is more nuanced.

Since these two containers are stuck in the same pod, when either of them are down, spire backend workload API is down. This will eventually take down spire-agents ability to service requests. The controller needs to be moved to a separate pod so its outages do not impact spire itself. I'm facing a problem where a federated endpoint lost SSL cert. Controller Manager is restarting causing outages to spire (not just federation problems).

2024-04-30T12:59:33Z    ERROR   setup   problem running manager {"error": "failed to wait for clusterspiffeid caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ClusterSPIFFEID"}
2024-04-30T12:59:33Z    DEBUG   events  spire-server-0_84b56424-9ab4-495f-bf58-c2efca64d303 stopped leading     {"type": "Normal", "object": {"kind":"Lease","namespace":"spire-server","name":"8aa27f40.spiffe.io","uid":"5deab3b1-5f1d-4855-8cc5-f15bcdcbbee0","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1272111663"}, "reason": "LeaderElection"}
2024-04-30T12:59:33Z    ERROR   error received after stop sequence was engaged  {"error": "leader election lost"}
faisal-memon commented 2 months ago

Created an issue on the controller manger to see if there is interest in supporting this deployment mode. https://github.com/spiffe/spire-controller-manager/issues/363