Open raghudeshu opened 2 years ago
@raghudeshu , when Cassandra is overloaded, sometime you might encounter that problem. If you lower your load, temporal server should be able to function even if Cassandra lost one node. Also, please upgrade to latest version for your load test, there have been many improvement/bugfixes since 1.13.
I've had the exact same problem. Temporal 1.18.5.
Also, when all three nodes came up again, we have the following errors in the Frontend servers. We've had these errors for days.
2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.165:9042: i/o timeout
2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.133:9042: i/o timeout
2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.197:9042: i/o timeout
More specifically:
gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout
{"level":"error","ts":"2023-05-02T10:32:17.328Z","msg":"gocql wrapper: unable to refresh gocql session","error":"gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout","logging-call-at":"session.go:99","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*session).refresh\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/session.go:99\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*session).handleError\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/session.go:191\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*iter).Close.func1\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/iter.go:56\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*iter).Close\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/iter.go:58\ngo.temporal.io/server/common/persistence/cassandra.(*ClusterMetadataStore).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/cassandra/cluster_metadata_store.go:120\ngo.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/clusterMetadataStore.go:127\ngo.temporal.io/server/common/persistence.(*clusterMetadataRateLimitedPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRateLimitedClients.go:993\ngo.temporal.io/server/common/persistence.(*clusterMetadataPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1393\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:972\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:976\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB.func1\n\t/home/builder/temporal/common/cluster/metadata.go:517\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/pagingIterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/pagingIterator.go:52\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB\n\t/home/builder/temporal/common/cluster/metadata.go:534\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshClusterMetadata\n\t/home/builder/temporal/common/cluster/metadata.go:404\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshLoop\n\t/home/builder/temporal/common/cluster/metadata.go:391\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:64"}
And operation ListClusterMetadata encountered gocql: no hosts available in the pool
{"level":"error","ts":"2023-05-02T10:32:17.329Z","msg":"Operation failed with internal error.","error":"operation ListClusterMetadata encountered gocql: no hosts available in the pool","metric-scope":81,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*clusterMetadataPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1397\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:972\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:976\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB.func1\n\t/home/builder/temporal/common/cluster/metadata.go:517\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/pagingIterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/pagingIterator.go:52\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB\n\t/home/builder/temporal/common/cluster/metadata.go:534\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshClusterMetadata\n\t/home/builder/temporal/common/cluster/metadata.go:404\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshLoop\n\t/home/builder/temporal/common/cluster/metadata.go:391\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:64"}
We have hit the exact same problem. We have cassandra 3 node with pod disruption allowed as 1. We had a cluster upgrade where in we had 2 out of 3 cassandra nodes were available and the temporal server pods were restarted as part of upgrade. Old pods have been terminated and the new pods were failing to load with out any error message on the logs. Server pods have started working once all the cassandra nodes have been completely restarted.
This issue exists on the latest version(1.20.2) of the temporal server as well.
It maybe fixed by gocql 1.4.0 https://github.com/gocql/gocql/releases/tag/v1.4.0 which is included in temporal server v1.21.5
We are connecting to an existing cassandra setup and seeing this issue even for latest temporal image with tag v1.21.5
From the error message, it seems even if the config provided multiple IP addresses as hosts, the gocql still fail to connect even if a single host was down. We log the error when try to reconnect. From the above message, "gocql wrapper: unable to refresh gocql session","error":"gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout"
, it seems either there is only one host configured or the gocql driver is not working as expected.
Expected Behavior
We have a 3 node cluster Even when one node is down we are expecting the temporal to work by connecting to other two nodes.
Actual Behavior:
We are doing the load test and we observed Temporal is not able to connect to Cassandra even when ONE node is down
Steps to Reproduce the Problem
Error: 2022/04/14 23:41:44 error: failed to connect to XX.XXX.XX.7:9042 due to error: write tcp XX.XXX.XX.79:44342->XX.XXX.XX.7:9042: write: connection reset by peer unable to dial control conn XX.XXX.XX.7:9042 gocql: no response received from cassandra within timeout period
Below is my Configuration:
cassandra: hosts: [“XX.XXX.XX.7,XX.XXX.XX.9,XX.XXX.XX.10”] port: 9042 keyspace: temporal user: “temporal” password: “XXXXXXX” existingSecret: “” replicationFactor: 3 (Tried both 1 and 3) consistency: default: consistency: “local_quorum” serialConsistency: “local_serial” tls: enabled: true enableHostVerification: false
Note: We are mentioning the cluster info with comma separated ip’s. We did updated Replication factor with 1 ,3 both did not worked.
Specifications
Cassandra Configuration: