tryretool / retool-helm

MIT License
45 stars 57 forks source link

Error from workflow worker: "Failed to connect to Temporal server" #96

Closed plumdog closed 9 months ago

plumdog commented 1 year ago

Pod for the workflow-worker deployment is failing with the following logs:

wait-for-it.sh: waiting for mydatabase.myregion.rds.amazonaws.com:5432 without a timeout
wait-for-it.sh: mydatabase.myregion.rds.amazonaws.com:5432 is available after 0 seconds
not untarring the bundle
{"message":"[process service types] WORKFLOW_TEMPORAL_WORKER","level":"info","timestamp":"2023-05-22T13:26:33.658Z"}

Warning: POSTGRES_SSL_REJECT_UNAUTHORIZED is currently set to 'false'. This will default to 'true' in a future version of Retool, which may break connections to databases with self-signed SSL/TLS certificates. To prepare for this change, either explicitly set POSTGRES_SSL_REJECT_UNAUTHORIZED=false or configure a custom certificate chain by setting POSTGRES_CUSTOM_SSL_CERT_PATH & POSTGRES_CUSTOM_SSL_CA_FILE_NAME (and optionally POSTGRES_CUSTOM_SSL_CERT_FILE_NAME & POSTGRES_CUSTOM_SSL_KEY_FILE_NAME) — see https://docs.retool.com/docs/environment-variables.

{"message":"Not configuring StatsD...","level":"info","timestamp":"2023-05-22T13:28:22.141Z"}
{"message":"Not configuring StatsD...","level":"info","timestamp":"2023-05-22T13:28:22.144Z"}
Setting http and https agent maxSockets to 25
(node:18) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /snapshot/retool_development/node_modules/@tryretool/common/package.json.
Update this package.json to use a subpath pattern like "./*".
(Use `retool_backend --trace-deprecation ...` to show where the warning was created)
(node:18) [LRU_CACHE_UNBOUNDED] UnboundedCacheWarning: TTL caching without ttlAutopurge, max, or maxSize can result in unbounded memory consumption.
(node:18) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /node_modules/@tryretool/workflowsBackend/package.json.
Update this package.json to use a subpath pattern like "./*".
(node:18) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /node_modules/@tryretool/common/package.json.
Update this package.json to use a subpath pattern like "./*".
(node:18) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /packages/common/package.json imported from /packages/common/build/workflows/types.js.
Update this package.json to use a subpath pattern like "./*".
(node:18) [LRU_CACHE_OPTION_maxAge] DeprecationWarning: The maxAge option is deprecated. Please use options.ttl instead.
(node:18) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /packages/workflowsBackend/package.json imported from /packages/workflowsBackend/build/runBlocksLambdaHandler.js.
Update this package.json to use a subpath pattern like "./*".
(node:18) [DEP0111] DeprecationWarning: Access to process.binding('http_parser') is deprecated.
{"message":"Rechecking license status...","level":"info","timestamp":"2023-05-22T13:28:23.254Z"}
{"message":"license check http response code: 200","level":"info","timestamp":"2023-05-22T13:28:23.803Z"}
{"message":"License key feature flag overrides: {}","level":"info","timestamp":"2023-05-22T13:28:23.806Z"}
{"message":"Updated license status from licensing server","level":"info","timestamp":"2023-05-22T13:28:23.811Z"}
{"message":"installing temporal runtime","level":"info","timestamp":"2023-05-22T13:28:23.842Z"}
{"message":"creating temporal worker connection","level":"info","timestamp":"2023-05-22T13:28:23.844Z"}
{"message":"Scheduling UpdateTimedOutWorkflows","level":"info","timestamp":"2023-05-22T13:28:23.845Z"}
{"message":"creating temporal worker","level":"info","timestamp":"2023-05-22T13:28:23.962Z"}
{"label":"worker","level":"info","message":"Creating worker","timestamp":1684762103963,"options":{"namespace":"workflows","identity":"18@retool-workflows-test-retool-wf-workflow-worker-74948868d62k78z-workflows-workflows","shutdownGraceTime":"11 minute","maxConcurrentActivityTaskExecutions":10,"maxConcurrentLocalActivityExecutions":10,"enableNonLocalActivities":true,"maxConcurrentWorkflowTaskExecutions":10,"stickyQueueScheduleToStartTimeout":"10s","maxHeartbeatThrottleInterval":"60s","defaultHeartbeatThrottleInterval":"30s","isolateExecutionTimeout":"5s","workflowThreadPoolSize":8,"maxCachedWorkflows":200,"enableSDKTracing":false,"showStackTraceSources":false,"debugMode":false,"interceptors":{"activityInbound":[null,null,null],"workflowModules":["/snapshot/retool_development/node_modules/@temporalio/worker/lib/workflow-log-interceptor.js"]},"sinks":{"defaultWorkerLogger":{"trace":{},"debug":{},"info":{},"warn":{},"error":{}}},"bundlerOptions":{},"workflowBundle":{"codePath":"/snapshot/retool_development/backend/transpiled/temporal/workflowsExecutor/workflowsExecutor-workflows-bundle.js"},"activities":{},"taskQueue":"workflows","connection":{"nativeClient":{},"referenceHolders":{}},"shutdownGraceTimeMs":660000,"stickyQueueScheduleToStartTimeoutMs":10000,"isolateExecutionTimeoutMs":5000,"maxHeartbeatThrottleIntervalMs":60000,"defaultHeartbeatThrottleIntervalMs":30000,"loadedDataConverter":{"payloadConverter":{"converterByEncoding":{},"converters":[{"encodingType":"binary/null"},{"encodingType":"binary/plain"},{"encodingType":"json/plain"}]},"failureConverter":{"options":{"encodeCommonAttributes":false}},"payloadCodecs":[]}}}
{"label":"worker","level":"warn","message":"Ignoring WorkerOptions.bundlerOptions because WorkerOptions.workflowBundle is set","timestamp":1684762103963}
{"message":"Ran into error when scheduling UpdateTimedOutWorkflows ServiceError: Failed to connect to Temporal server","level":"error","timestamp":"2023-05-22T13:28:38.280Z"}
/snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:167
                            throw new errors_1.ServiceError('Failed to connect to Temporal server', { cause: err });
                                  ^

ServiceError: Failed to connect to Temporal server
    at /snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:167:35
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processTimers (node:internal/timers:499:9)
    at async getTemporalClient (/snapshot/retool_development/backend/transpiled/temporal/common/client.js)
    at async startUpdateTimedOutWorkflows (/snapshot/retool_development/backend/transpiled/temporal/workflowsExecutor/client.js)
    at async Object.main (/snapshot/retool_development/backend/transpiled/temporal/workflowsExecutor/main.js) {
  cause: Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
      at Object.callErrorFromStatus (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
      at Object.onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/client.js:195:52)
      at /snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
      at onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/lib/grpc-retry.js:134:25)
      at Object.onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/lib/grpc-retry.js:137:17)
      at InterceptingListenerImpl.onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
      at Object.onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
      at Object.onReceiveStatus (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
      at /snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/call-stream.js:188:78
      at processTicksAndRejections (node:internal/process/task_queues:78:11)
      at runNextTicks (node:internal/process/task_queues:65:3)
      at processTimers (node:internal/timers:499:9)
  for call at
      at ServiceClientImpl.makeUnaryRequest (/snapshot/retool_development/node_modules/@temporalio/client/node_modules/@grpc/grpc-js/build/src/client.js:163:34)
      at Service.rpcImpl (/snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:206:27)
      at Service.rpcCall (/snapshot/retool_development/node_modules/@temporalio/proto/node_modules/protobufjs/src/rpc/service.js:94:21)
      at executor (/snapshot/retool_development/node_modules/@protobufjs/aspromise/index.js:44:16)
      at new Promise (<anonymous>)
      at Object.asPromise (/snapshot/retool_development/node_modules/@protobufjs/aspromise/index.js:28:12)
      at Service.rpcCall (/snapshot/retool_development/node_modules/@temporalio/proto/node_modules/protobufjs/src/rpc/service.js:86:21)
      at Service.getSystemInfo (eval at Codegen (/snapshot/retool_development/node_modules/@protobufjs/codegen/index.js:50:33), <anonymous>:4:15)
      at /snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:161:82
      at AsyncLocalStorage.run (node:async_hooks:322:14)
      at Connection.withDeadline (/snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:216:46)
      at /snapshot/retool_development/node_modules/@temporalio/client/lib/connection.js:161:32
      at processTicksAndRejections (node:internal/process/task_queues:96:5)
      at async getTemporalClient (/snapshot/retool_development/backend/transpiled/temporal/common/client.js)
      at async startUpdateTimedOutWorkflows (/snapshot/retool_development/backend/transpiled/temporal/workflowsExecutor/client.js)
      at async Object.main (/snapshot/retool_development/backend/transpiled/temporal/workflowsExecutor/main.js) {
    code: 4,
    details: 'Deadline exceeded',
    metadata: Metadata { internalRepr: Map(0) {}, options: {} }
  }
}

The pod then restarts as the command has exited with exit code 1, eventually going into CrashLoopBackOff.

This seems to be telling me that the workflow-worker pod can't connect to temporal. But if I port-forward to temporal from my local machine, I can talk to temporal using tctl, so I think temporal is working. I suspect this means something is wrong in the config being passed to the workflow-worker pod.

Relevant parts of my values passed to helm:

config:
  encryptionKey: ...
  jwtSecret: ...
  licenseKey: ...
  postgresql:
    db: retool_workflows_test
    host: mydatabase.myregion.rds.amazonaws.com
    password: ...
    port: 5432
    ssl_enabled: true
    user: retool_workflows_test
image:
  tag: 2.117.9
ingress:
  enabled: false
livenessProbe:
  enabled: true
  initialDelaySeconds: 200
postgresql:
  enabled: false
readinessProbe:
  enabled: true
resources:
  limits:
    cpu: 1000m
    memory: 2000Mi
  requests:
    cpu: 1000m
    memory: 2000Mi
retool-temporal-services-helm:
  server:
    config:
      persistence:
        default:
          sql:
            database: retool_workflows_test
            host: mydatabase.myregion.rds.amazonaws.com
            password: ...
            port: 5432
            tls:
              enabled: true
            user: retool_workflows_test
        visibility:
          sql:
            database: retool_workflows_test
            host: mydatabase.myregion.rds.amazonaws.com
            password: ...
            port: 5432
            tls:
              enabled: true
            user: retool_workflows_test
service:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: ...
    service.beta.kubernetes.io/aws-load-balancer-security-groups: ...
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: ...
  externalPort: 443
  internalPort: 3000
  type: LoadBalancer
workflows:
  enabled: true

And that's deployed with chart https://charts.retool.com/retool-wf-4.11.12.tgz on Kubernetes 1.24 (AWS EKS).

All pods other than the workflow-worker are ready, and are not restarting.

What can I do to debug and fix this?

plumdog commented 1 year ago

This appears to have been fixed by changing image.tag from 2.117.3 to 2.119.3.

I also tried latest in the 2.117 series, which was 2.117.9, and this also seemed to fail, so I think something is broken with workflows in 2.117 that is fixed in 2.119. But https://github.com/tryretool/retool-workflows-helm#usage says "The minimum supported image for Retool Workflows is 2.108.4", so I would expect 2.117.[latest] to work.

From some more checking: tag works with workflows
2.116.12 :heavy_check_mark:
2.117.3 :x:
2.117.9 :x:
2.119.1 :heavy_check_mark:
2.119.3 :heavy_check_mark:

(it appears there are no tags in the semver range between 2.117.9 and 2.119.1 on Docker Hub:

$ curl -s "https://hub.docker.com/v2/namespaces/tryretool/repositories/backend/tags?page_size=100" | jq -r '.results[] | .name' | grep -v -e 'latest$' -e '-enterprise$' | sort -rV | grep '^2\.11[7-9]'
2.119.3
2.119.2
2.119.1
2.117.9
2.117.8
2.117.7
2.117.6
2.117.5
2.117.4
2.117.3
2.117.2

)

So perhaps this fix needs to be backported to the 2.117 release line. However, from https://updates.retool.com/en/whats-new-in-self_hosted-retool-2119-45487736, none of the "fixed" items appear to have anything to do with this, so hard to tell what is going on.