temporalio / helm-charts

Temporal Helm charts
MIT License
305 stars 337 forks source link

Pods Stuck in CrashLoopBackoff on Fresh Deployment to Fresh Kubernetes Cluster #470

Closed marc-wilson closed 3 months ago

marc-wilson commented 7 months ago

What are you really trying to do?

Install temporal locally on Docker Desktop w/ Kubernetes enabled

Ran through the steps outlined in the read me:

helm dependencies update

then the minimal setup

helm install \
    --set server.replicaCount=1 \
    --set cassandra.config.cluster_size=1 \
    --set prometheus.enabled=false \
    --set grafana.enabled=false \
    --set elasticsearch.enabled=false \
    temporaltest . --timeout 15m

This results in 4 pods stuck in crashloopbackoff state

NAME                                      READY   STATUS             RESTARTS        AGE
temporaltest-admintools-75bfc76c5-w5lf5   1/1     Running            0               23m
temporaltest-frontend-69fc9f6b79-96blk    0/1     CrashLoopBackOff   5 (2m40s ago)   5m44s
temporaltest-history-66f85dd854-c2hnk     0/1     CrashLoopBackOff   9 (42s ago)     23m
temporaltest-matching-85f879f4c4-qlkmz    0/1     CrashLoopBackOff   4 (43s ago)     2m24s
temporaltest-web-776786467b-8m97g         1/1     Running            0               23m
temporaltest-worker-655c7c6467-zjjgm      0/1     CrashLoopBackOff   4 (63s ago)     2m33s

Describe the bug

There seems to be an extra step that is missing from the read me. Looking at the logs for the temporaltest-frontend, I get this error. Seems like I need to configure some sort of data source? I see a cassandra instance running...

[Fx] RUN supply: stub([]temporal.ServerOption) [Fx] RUN provide: go.temporal.io/server/temporal.ServerOptionsProvider() [Fx] Error returned: received non-nil error from function "go.temporal.io/server/temporal".ServerOptionsProvider /home/builder/temporal/temporal/fx.go:173: config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store [Fx] ERROR Failed to initialize custom logger: could not build arguments for function "go.uber.org/fx".(module).constructCustomLogger.func2 /go/pkg/mod/go.uber.org/fx@v1.20.0/module.go:251: failed to build fxevent.Logger: could not build arguments for function "go.temporal.io/server/temporal".glob..func8 /home/builder/temporal/temporal/fx.go:1037: failed to build log.Logger: received non-nil error from function "go.temporal.io/server/temporal".ServerOptionsProvider /home/builder/temporal/temporal/fx.go:173: config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store Unable to create server. Error: could not build arguments for function "go.uber.org/fx".(module).constructCustomLogger.func2 (/go/pkg/mod/go.uber.org/fx@v1.20.0/module.go:251): failed to build fxevent.Logger: could not build arguments for function "go.temporal.io/server/temporal".glob..func8 (/home/builder/temporal/temporal/fx.go:1037): failed to build log.Logger: received non-nil error from function "go.temporal.io/server/temporal".ServerOptionsProvider (/home/builder/temporal/temporal/fx.go:173): config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store.

image

Environment/Versions

Additional context

marc-wilson commented 7 months ago

Seems to be the same issue as this: https://community.temporal.io/t/errors-while-setting-up-temporal-on-local-environment-using-helm-chart/3088/4

myst3k commented 7 months ago

Just ran into this, you can run this to get up and running.

git reset --hard 1e5ac0c442ba45c2a5aead316aebdad3301c6d81

yourbuddyconner commented 5 months ago

Thanks for the tip, this got the chart working for me @myst3k

armenr commented 4 months ago

I have run into the same exact problem. What should the fix be?

temporal/kartes-test-temporal-history-6648df985d-sh876[temporal-history]: config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store temporal/kartes-test-temporal-history-6648df985d-sh876[temporal-history]: Unable to create server. Error: could not build arguments for function "go.uber.org/fx".(*module).constructCustomLogger.func2 (/home/runner/go/pkg/mod/go.uber.org/fx@v1.21.1/module.go:292): failed to build fxevent.Logger: could not build arguments for function "go.temporal.io/server/temporal".glob..func8 (/home/runner/work/docker-builds/docker-builds/temporal/temporal/fx.go:1029): failed to build log.Logger: received non-nil error from function "go.temporal.io/server/temporal".ServerOptionsProvider (/home/runner/work/docker-builds/docker-builds/temporal/temporal/fx.go:180): config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store.

armenr commented 4 months ago

It looks to me like there's a problem with the configmaps that get generated...

Command we use to install temporal

helm install \
--set server.replicaCount=1 \
--set cassandra.config.cluster_size=1 \
--set prometheus.enabled=false \
--set grafana.enabled=false \
--set elasticsearch.enabled=false \
my-test . --timeout 15m --namespace temporal

The error

config validation error: persistence config: datastore "visibility": must provide config for one and only one datastore: elasticsearch, cassandra, sql or custom store.

When I look at the configmaps...

---
# Source: temporal/templates/server-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "my-test-temporal-worker-config"
  labels:
    app.kubernetes.io/name: temporal
    helm.sh/chart: temporal-0.38.1
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: my-test
    app.kubernetes.io/version: 1.23.1
    app.kubernetes.io/part-of: temporal
data:
  config_template.yaml: |-
    log:
      stdout: true
      level: "debug,info"

    persistence:
      defaultStore: default
      visibilityStore: visibility
      numHistoryShards: 512
      datastores:
        default:
          cassandra:
            hosts: "my-test-cassandra.temporal.svc.cluster.local,"
            port: 9042
            password: "{{ .Env.TEMPORAL_STORE_PASSWORD }}"
            consistency:
              default:
                consistency: local_quorum
                serialConsistency: local_serial
            keyspace: temporal
            replicationFactor: 1
            user: user
        visibility:

    global:
      membership:
        name: temporal
        maxJoinDuration: 30s
        broadcastAddress: {{ default .Env.POD_IP "0.0.0.0" }}

      pprof:
        port: 7936

      metrics:
        tags:
          type: worker
        prometheus:
          timerType: histogram
          listenAddress: "0.0.0.0:9090"

    services:
      frontend:
        rpc:
          grpcPort: 7233
          membershipPort: 6933
          bindOnIP: "0.0.0.0"

      history:
        rpc:
          grpcPort: 7234
          membershipPort: 6934
          bindOnIP: "0.0.0.0"

      matching:
        rpc:
          grpcPort: 7235
          membershipPort: 6935
          bindOnIP: "0.0.0.0"

      worker:
        rpc:
          grpcPort: 7239
          membershipPort: 6939
          bindOnIP: "0.0.0.0"
    clusterMetadata:
      enableGlobalDomain: false
      failoverVersionIncrement: 10
      masterClusterName: "active"
      currentClusterName: "active"
      clusterInformation:
        active:
          enabled: true
          initialFailoverVersion: 1
          rpcName: "temporal-frontend"
          rpcAddress: "127.0.0.1:7933"
    dcRedirectionPolicy:
      policy: "noop"
      toDC: ""
    archival:
      status: "disabled"

    publicClient:
      hostPort: "my-test-temporal-frontend:7233"

    dynamicConfigClient:
      filepath: "/etc/temporal/dynamic_config/dynamic_config.yaml"
      pollInterval: "10s"

And it looks - specifically - like the problem is here:

      datastores:
        default:
          cassandra:
            hosts: "my-test-cassandra.temporal.svc.cluster.local,"
            port: 9042
            password: "{{ .Env.TEMPORAL_STORE_PASSWORD }}"
            consistency:
              default:
                consistency: local_quorum
                serialConsistency: local_serial
            keyspace: temporal
            replicationFactor: 1
            user: user
        visibility:
        # ^^^ PROBLEM

There's a key for visibility: but nothing that follows...

I will have to look into the helm chart more deeply to understand why this is the case...but it appears to be the real source of the problem.

armenr commented 4 months ago

For anyone who cares, this seems to work around the issue...just create it with elasticsearch enabled, with a single elasticsearch replica...and voila, it works

helm install \
  --set server.replicaCount=1 \
  --set cassandra.config.cluster_size=1 \
  --set prometheus.enabled=false \
  --set grafana.enabled=false \
  --set elasticsearch.enabled=true \
  --set elasticsearch.replicas=1 \
my-test . --timeout 15m --namespace temporal
robholland commented 3 months ago

Cassandra is not a valid visibility backend. You will have to enable ES visibility if you are using Cassandra for persistence currently. The chart does not yet support using SQL for visibility if you use Cassandra for persistence. I will file an issue to see if we can get the error message updated so that it does not unhelpfully suggest that you can use cassandra for visibility.

s-nilsson commented 3 months ago

Unfortunately the project readme does not provide any such information.

myst3k commented 3 months ago

These are the docs on the helm chart, I just want to get up and running quickly with a dev setup. It sounds like it may just need to be updated to reflect the current requirements.

image

robholland commented 3 months ago

Added a PR to clarify.