vmware-tanzu-labs / educates-training-platform

A platform for hosting interactive workshop environments in Kubernetes, or on top of a local container runtime.
https://docs.educates.dev
Apache License 2.0
74 stars 20 forks source link

Kubernetes REST API issues for kopf on gke-autopilot. #202

Open GrahamDumpleton opened 1 year ago

GrahamDumpleton commented 1 year ago

Describe the bug

The auto scaling of nodes on gke-autopilot is triggering an issue with kopf based operators whereby they aren't handling connection resets and connection stalls occurring when GKE is adding new nodes to the Kubernetes control plane. This has been observed causing the secrets manager to stop working. Not seen issues with session manager and training portal at this stage so need to check whether liveness checks are actually implemented correctly for secrets manager. Even if liveness are working, having to restart operator pods to recover is not ideal.

Possibly related kopf issues:

Additional information

No response

GrahamDumpleton commented 1 year ago

BTW, the gke-autopilot Kubernetes cluster is not recommended for use with Educates as the automatic scaling of nodes could cause nodes to be evacuated and shutdown when deemed they aren't using enough resources and the control plane decides that it can fit resources on a different node. This constant rebalancing done by gke-autopilot would cause active workshop sessions to be interrupted, breaking a users session.

For Educates, clusters using autoscaling, and in particular any which are aggressive at scaling down the number of nodes with assumption that pods can be interrupted and moved at any time, should thus be avoided.

GrahamDumpleton commented 1 year ago

Need to also explain in documentation that auto scaling clusters like this should be avoided because of possible impacts.

GrahamDumpleton commented 1 year ago

Have added connect and server timeouts in https://github.com/vmware-tanzu-labs/educates-training-platform/commit/285f63692a3f3c90c1ddceb8fca256ffce23039f and will see whether this helps. Will keep issue open while monitor.