temporalio / helm-charts

Temporal Helm charts
MIT License
312 stars 339 forks source link

[Bug] Pods unable to reach Cassandra during Init #373

Closed guptamridul1809 closed 4 months ago

guptamridul1809 commented 1 year ago

What are you really trying to do?

I'm trying to run temporal using helm chart helm install --set server.replicaCount=1 --set cassandra.config.cluster_size=1 --set prometheus.enabled=false --set grafana.enabled=false --set elasticsearch.enabled=false temporaltest . --timeout 150m

Describe the bug

Several pods are stuck in init state

image

I let it try and complete for over an hour, no progress though

On further inspection, it was found that all the pods are waiting for cassandra nslookup to succeed

image

Minimal Reproduction

Just running helm install --set server.replicaCount=1 --set cassandra.config.cluster_size=1 --set prometheus.enabled=false --set grafana.enabled=false --set elasticsearch.enabled=false temporaltest . --timeout 150m gives this error I tried with latest master and with release 1.12 to consider an older setup, both setups run into same issue

Environment/Versions

OS:

image

Docker:

image

Minikube:

image

Helm:

image

Additional context

mindaugasrukas commented 1 year ago

Investigating. I see similar behavior with one difference - it's not being stuck waiting for Cassandra nslookup to succeed, but for schema being populated.

chaychoong commented 1 year ago

Based on this line, the job to perform the schema-setup will only start after everything is loaded. Which can't happen because the rest of the server components depend on having schema be set up.

Simply removing that line solves everything

@mindaugasrukas

mindaugasrukas commented 1 year ago

I think those hooks are correct. According to this document: https://helm.sh/docs/topics/charts_hooks/#hooks-and-the-release-lifecycle, "loaded" doesn't mean it's blocking.

1. User runs helm install foo
2. The Helm library install API is called
3. CRDs in the crds/ directory are installed
4. After some verification, the library renders the foo templates
5. The library prepares to execute the pre-install hooks (loading hook resources into Kubernetes)
6. The library sorts hooks by weight (assigning a weight of 0 by default), by resource kind and
   finally by name in ascending order.
7. The library then loads the hook with the lowest weight first (negative to positive)
8. The library waits until the hook is "Ready" (except for CRDs)
9. The library loads the resulting resources into Kubernetes. Note that if the --wait flag is set,
   the library will wait until all resources are in a ready state and will not run the post-install hook
   until they are ready.
10. The library executes the post-install hook (loading hook resources)
11. The library waits until the hook is "Ready"
12. The library returns the release object (and other data) to the client
13. The client exits

What does it mean to wait until a hook is ready? This depends on the resource declared in the hook.
If the resource is a Job or Pod kind, Helm will wait until it successfully runs to completion. And if the
hook fails, the release will fail. This is a blocking operation, so the Helm client will pause while the
Job is run.

For all other kinds, as soon as Kubernetes marks the resource as loaded (added or updated),
the resource is considered "Ready".

But I see your point of making it a non-hook job and load together with all other resources.

mindaugasrukas commented 1 year ago

@guptamridul1809, @chaychoong, could you try adding --set debug=true to the helm command and paste schema-setup and schema-update Job logs? For me, they report success, but the schema is actually not loaded. I want to make sure you are having the same issue. Also, this issue is flaky on my side. Hence, I'm unsure if removing the helm.sh/hook annotation is related here.

Also, could you paste the DB content for:

kubectl exec service/temporal-admintools -- cqlsh temporal-cassandra 9042 -k temporal -e "SELECT * FROM schema_update_history"

kubectl exec service/temporal-admintools -- cqlsh temporal-cassandra 9042 -k temporal -e "SELECT curr_version FROM schema_version"

kubectl exec service/temporal-admintools -- cqlsh temporal-cassandra 9042 -k temporal_visibility -e "SELECT * FROM schema_update_history"

kubectl exec service/temporal-admintools -- cqlsh temporal-cassandra 9042 -k temporal_visibility -e "SELECT curr_version FROM schema_version"
robholland commented 4 months ago

Closing due to lack of feedback. Please re-open if this issue persists with the current chart version.