temporalio / helm-charts

Temporal Helm charts
MIT License
294 stars 321 forks source link

[Bug] Persistent DNS and Resource Allocation Issues During Temporal Deployment on Kubernetes #416

Closed LukaGiorgadze closed 11 months ago

LukaGiorgadze commented 1 year ago

What are you really trying to do?

I am working on deploying and configuring the Temporal application within a GKE Autopilot cluster. During this process, I've encountered DNS resolution issues and node resource allocation challenges which have prevented some pods, like temporaltest-frontend, from running successfully. I've tried diagnosing the issue through various Kubernetes commands and even considered setting up a new cluster to resolve it. I'm seeking guidance on how to properly configure and troubleshoot these challenges to achieve a stable Temporal deployment in my Kubernetes environment.

Describe the bug

When deploying the Temporal application on my Kubernetes cluster, several pods are unable to achieve a running state. Specific DNS resolution problems are arising, preventing services like temporaltest-cassandra from being found. Additionally, scheduling issues related to resource constraints, particularly memory and CPU, are hindering certain pods, such as temporaltest-frontend, from being allocated to nodes. Despite attempts to diagnose and rectify the situation through various Kubernetes commands and even considering the establishment of a new cluster, the issues persist, making the deployment unstable.

❯ helm install temporaltest . --timeout 900s --create-namespace --namespace temporaltest
W0823 16:53:15.841008   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-web: defaulted unspecified resources for containers [temporal-web] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:15.874608   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-prometheus-pushgateway: defaulted unspecified resources for containers [prometheus-pushgateway] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.084537   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-kube-state-metrics: defaulted unspecified resources for containers [kube-state-metrics] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.123693   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-prometheus-server: defaulted unspecified resources for containers [prometheus-server-configmap-reload, prometheus-server] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.213583   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-grafana: defaulted unspecified resources for containers [download-dashboards, grafana] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.428421   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-admintools: defaulted unspecified resources for containers [admin-tools] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.656661   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-frontend: defaulted unspecified resources for containers [check-cassandra-service, check-cassandra, check-cassandra-temporal-schema, check-cassandra-visibility-schema, check-elasticsearch-index, temporal-frontend] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:16.772396   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-prometheus-alertmanager: defaulted unspecified resources for containers [prometheus-alertmanager, prometheus-alertmanager-configmap-reload] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:17.166050   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-worker: defaulted unspecified resources for containers [check-cassandra-service, check-cassandra, check-cassandra-temporal-schema, check-cassandra-visibility-schema, check-elasticsearch-index, temporal-worker] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:17.954024   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-history: defaulted unspecified resources for containers [check-cassandra-service, check-cassandra, check-cassandra-temporal-schema, check-cassandra-visibility-schema, check-elasticsearch-index, temporal-history] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:17.976409   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment temporaltest/temporaltest-matching: defaulted unspecified resources for containers [check-cassandra-service, check-cassandra, check-cassandra-temporal-schema, check-cassandra-visibility-schema, check-elasticsearch-index, temporal-matching] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:18.680914   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated StatefulSet temporaltest/temporaltest-cassandra: defaulted unspecified resources for containers [temporaltest-cassandra] (see http://g.co/gke/autopilot-defaults)
W0823 16:53:19.079877   61298 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated StatefulSet temporaltest/elasticsearch-master: defaulted unspecified resources for containers [configure-sysctl] (see http://g.co/gke/autopilot-defaults), and adjusted resources to meet requirements for containers [elasticsearch] (see http://g.co/gke/autopilot-resources)
Error: INSTALLATION FAILED: 1 error occurred:
    * admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-disallow-privilege]":["container configure-sysctl is privileged; not allowed in Autopilot"]}
Requested by user: 'luka@pavebank.com', groups: 'system:authenticated'.
Screenshot 2023-08-23 at 18 07 43

Minimal Reproduction

  1. Create GKE Autopilot cluster
  2. git clone https://github.com/temporalio/helm-charts
  3. helm dependencies update
  4. helm install temporaltest . --timeout 900s --create-namespace --namespace temporaltest

Environment/Versions

Additional Details

kubectl describe pod -n temporaltest temporaltest-cassandra-0
...
Events:
  Type     Reason            Age                    From                                   Message
  ----     ------            ----                   ----                                   -------
  Warning  FailedScheduling  81m (x2 over 81m)      gke.io/optimize-utilization-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  80m                    gke.io/optimize-utilization-scheduler  0/7 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 5 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/7 nodes are available: 2 No preemption victims found for incoming pod, 5 Preemption is not helpful for scheduling..
  Normal   Scheduled         80m                    gke.io/optimize-utilization-scheduler  Successfully assigned temporaltest/temporaltest-cassandra-0 to gk3-temporal-cluster-1-pool-2-9a34a308-klkn
  Normal   Pulling           79m                    kubelet                                Pulling image "cassandra:3.11.3"
  Normal   Pulled            79m                    kubelet                                Successfully pulled image "cassandra:3.11.3" in 12.373198163s (12.373421297s including waiting)
  Normal   Created           79m                    kubelet                                Created container temporaltest-cassandra
  Normal   Started           79m                    kubelet                                Started container temporaltest-cassandra
  Warning  Unhealthy         69m (x12 over 77m)     kubelet                                Readiness probe errored: command "/bin/sh -c nodetool status | grep -E \"^UN\\s+${POD_IP}\"" timed out
  Warning  Unhealthy         4m36s (x115 over 77m)  kubelet                                Liveness probe errored: command "/bin/sh -c nodetool status" timed out
LukaGiorgadze commented 11 months ago

Closed. The issue was because of insufficient resources that elastic search needs. Increased Node + added more resources and it works.