Open GrahamDumpleton opened 1 year ago
BTW, the gke-autopilot Kubernetes cluster is not recommended for use with Educates as the automatic scaling of nodes could cause nodes to be evacuated and shutdown when deemed they aren't using enough resources and the control plane decides that it can fit resources on a different node. This constant rebalancing done by gke-autopilot would cause active workshop sessions to be interrupted, breaking a users session.
For Educates, clusters using autoscaling, and in particular any which are aggressive at scaling down the number of nodes with assumption that pods can be interrupted and moved at any time, should thus be avoided.
Need to also explain in documentation that auto scaling clusters like this should be avoided because of possible impacts.
Have added connect and server timeouts in https://github.com/vmware-tanzu-labs/educates-training-platform/commit/285f63692a3f3c90c1ddceb8fca256ffce23039f and will see whether this helps. Will keep issue open while monitor.
Describe the bug
The auto scaling of nodes on gke-autopilot is triggering an issue with kopf based operators whereby they aren't handling connection resets and connection stalls occurring when GKE is adding new nodes to the Kubernetes control plane. This has been observed causing the secrets manager to stop working. Not seen issues with session manager and training portal at this stage so need to check whether liveness checks are actually implemented correctly for secrets manager. Even if liveness are working, having to restart operator pods to recover is not ideal.
Possibly related kopf issues:
Additional information
No response