SQLite failed due to missing DB table

mindaugasrukas commented 1 year ago

Yesterday I launched the single process for development: temporal server start-dev. Usually, I keep that running for a couple of days without any issues. But today, I got this HTTP 503 response on the web UI:

{"statusCode":503,"statusText":"Service Unavailable","response":{},"message":"GetClusterMetadata operation failed. Error: SQL logic error: no such table: cluster_metadata_info (1)"}

So I had to restart the process. I'm still trying to figure out how to reproduce or if this is a real issue, so I'm leaving this here for a record in case that repeats or we can better understand the problem.

Some log snippets:

{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetMetadata operation failed. Error: SQL logic error: no such table: namespace_metadata (1)","metric-scope":55,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/server@v1.18.5/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceMetricClients.go:908\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata.func1\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceRetryableClients.go:901\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/server@v1.18.5/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*metadataRetryablePersistenceClient).GetMetadata\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceRetryableClients.go:905\ngo.temporal.io/server/common/namespace.(*registry).refreshNamespaces\n\tgo.temporal.io/server@v1.18.5/common/namespace/registry.go:426\ngo.temporal.io/server/common/namespace.(*registry).refreshLoop\n\tgo.temporal.io/server@v1.18.5/common/namespace/registry.go:403\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/server@v1.18.5/internal/goro/goro.go:64"}

{"level":"error","ts":"2023-01-06T10:01:33.589-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","metric-scope":41,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\tgo.temporal.io/server@v1.18.5/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/server@v1.18.5/common/persistence/persistenceMetricClients.go:724\ngo.temporal.io/server/service/matching.(*taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/server@v1.18.5/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(*taskQueueDB).RenewLease\n\tgo.temporal.io/server@v1.18.5/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/server@v1.18.5/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/server@v1.18.5/common/backoff/retry.go:194\ngo.temporal.io/server/service/matching.(*taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/server@v1.18.5/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(*taskWriter).initReadWriteState\n\tgo.temporal.io/server@v1.18.5/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(*taskWriter).taskWriterLoop\n\tgo.temporal.io/server@v1.18.5/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\tgo.temporal.io/server@v1.18.5/internal/goro/goro.go:64"}

Expected Behavior

No issues.

Actual Behavior

A single process failed due to a missing DB table.

Steps to Reproduce the Problem

Unknown. I was not able to construct reproducible steps.

What I did initially:

% temporal server start-dev
do something for a day
next day, it fails

Specifications

Version:

% temporal -v              
temporal version 0.2.0 (server 1.18.5) (ui 2.9.0)

Platform:

% uname -mrs   
Darwin 21.6.0 arm64

mindaugasrukas commented 1 year ago

The same issues have been reported for version temporal version 0.5.0 (server 1.20.0) (ui 2.10.3):

{"level":"error","ts":"2023-02-22T08:56:41.887-0800","msg":"Operation failed with internal error.","error":"ListNamespaces operation failed. Failed to get namespace rows. Error: SQL logic error: no such table: namespaces (1)","operation":"ListNamespaces","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(zapLogger).Error\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(metricEmitter).recordRequestMetrics\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(metadataPersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:683\ngo.temporal.io/server/common/persistence.(metadataPersistenceClient).ListNamespaces\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:685\ngo.temporal.io/server/common/persistence.(metadataRetryablePersistenceClient).ListNamespaces.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceRetryableClients.go:887\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/backoff/retry.go:199\ngo.temporal.io/server/common/persistence.(metadataRetryablePersistenceClient).ListNamespaces\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceRetryableClients.go:891\ngo.temporal.io/server/common/namespace.(registry).refreshNamespaces\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/namespace/registry.go:386\ngo.temporal.io/server/common/namespace.(registry).refreshLoop\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/namespace/registry.go:357\ngo.temporal.io/server/internal/goro.(Handle).Go.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/internal/goro/goro.go:64"} {"level":"error","ts":"2023-02-22T08:56:41.892-0800","msg":"Operation failed with internal error.","error":"GetTaskQueue operation failed. Failed to check if task queue default-worker-tq of type Workflow existed. Error: SQL logic error: no such table: task_queues (1)","operation":"GetTaskQueue","logging-call-at":"persistenceMetricClients.go:1171","stacktrace":"go.temporal.io/server/common/log.(zapLogger).Error\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/log/zap_logger.go:150\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:1171\ngo.temporal.io/server/common/persistence.(metricEmitter).recordRequestMetrics\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:1148\ngo.temporal.io/server/common/persistence.(taskPersistenceClient).GetTaskQueue.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:567\ngo.temporal.io/server/common/persistence.(taskPersistenceClient).GetTaskQueue\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/persistence/persistenceMetricClients.go:569\ngo.temporal.io/server/service/matching.(taskQueueDB).takeOverTaskQueueLocked\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/db.go:123\ngo.temporal.io/server/service/matching.(taskQueueDB).RenewLease\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/db.go:109\ngo.temporal.io/server/service/matching.(taskWriter).renewLeaseWithRetry.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/taskWriter.go:302\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/common/backoff/retry.go:199\ngo.temporal.io/server/service/matching.(taskWriter).renewLeaseWithRetry\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/taskWriter.go:306\ngo.temporal.io/server/service/matching.(taskWriter).initReadWriteState\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/taskWriter.go:131\ngo.temporal.io/server/service/matching.(taskWriter).taskWriterLoop\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/service/matching/taskWriter.go:221\ngo.temporal.io/server/internal/goro.(Handle).Go.func1\n\tgo.temporal.io/server@v1.18.1-0.20230207023301-52c3a9eefb06/internal/goro/goro.go:64"}

bergundy commented 1 year ago

Posting some context from @yiminc:

Note that if the last database connection in the pool closes, the in-memory database is deleted. Make sure the max idle connection limit is > 0, and the connection lifetime is infinite.

ThePlenkov commented 1 year ago

I have a similar issue:

error while fetching cluster metadata: operation GetClusterMetadata encountered table cluster_metadata_info does not exist

mindaugasrukas commented 1 year ago

Linking for visibility: https://github.com/temporalio/cli/issues/124

mjameswh commented 4 months ago

I've been observing multiple flakes of this error message in TS SDK's integration tests recently. To be exact, 11 times in the last 3 weeks, vs none before that (as far as I can see in GHA logs).

In the context of those CI jobs, it only happens with the CLI Dev Server started at the GHA job level (i.e. not with Dev Server instances started using the SDK's built-in TestWorkflowEnvironment), using CLI 0.12.0 and 0.13.2. Interestingly, 9 times out of 11, the "error" started at almost the same place during the tests, in "Worker Lifecycle" tests.

I have modified the CI workflow to retain the server's logs on failure. Hopefully, I may be able to provide more data on this soon.

temporalio / temporal