temporalio / temporal

Temporal service
https://docs.temporal.io
MIT License
12.11k stars 850 forks source link

Temporal Is not able to connect to Cassandra even when one node is down in a cluster #2729

Open raghudeshu opened 2 years ago

raghudeshu commented 2 years ago

Expected Behavior

We have a 3 node cluster Even when one node is down we are expecting the temporal to work by connecting to other two nodes.

Actual Behavior:

We are doing the load test and we observed Temporal is not able to connect to Cassandra even when ONE node is down

Steps to Reproduce the Problem

  1. In a cluster make sure one node in Cassandra is down. The temporal pods are not able to connect to remaining other two nodes.

Error: 2022/04/14 23:41:44 error: failed to connect to XX.XXX.XX.7:9042 due to error: write tcp XX.XXX.XX.79:44342->XX.XXX.XX.7:9042: write: connection reset by peer unable to dial control conn XX.XXX.XX.7:9042 gocql: no response received from cassandra within timeout period

Below is my Configuration:

cassandra: hosts: [“XX.XXX.XX.7,XX.XXX.XX.9,XX.XXX.XX.10”] port: 9042 keyspace: temporal user: “temporal” password: “XXXXXXX” existingSecret: “” replicationFactor: 3 (Tried both 1 and 3) consistency: default: consistency: “local_quorum” serialConsistency: “local_serial” tls: enabled: true enableHostVerification: false

Note: We are mentioning the cluster info with comma separated ip’s. We did updated Replication factor with 1 ,3 both did not worked.

Specifications

Cassandra Configuration:

MicrosoftTeams-image (1)

yiminc commented 2 years ago

@raghudeshu , when Cassandra is overloaded, sometime you might encounter that problem. If you lower your load, temporal server should be able to function even if Cassandra lost one node. Also, please upgrade to latest version for your load test, there have been many improvement/bugfixes since 1.13.

johanforssell commented 1 year ago

I've had the exact same problem. Temporal 1.18.5.

Also, when all three nodes came up again, we have the following errors in the Frontend servers. We've had these errors for days.

2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.165:9042: i/o timeout
2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.133:9042: i/o timeout
2023/05/02 10:29:51 gocql: unable to dial control conn x.x.x.133:9042: dial tcp x.x.x.197:9042: i/o timeout

More specifically:

gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout

{"level":"error","ts":"2023-05-02T10:32:17.328Z","msg":"gocql wrapper: unable to refresh gocql session","error":"gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout","logging-call-at":"session.go:99","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*session).refresh\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/session.go:99\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*session).handleError\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/session.go:191\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*iter).Close.func1\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/iter.go:56\ngo.temporal.io/server/common/persistence/nosql/nosqlplugin/cassandra/gocql.(*iter).Close\n\t/home/builder/temporal/common/persistence/nosql/nosqlplugin/cassandra/gocql/iter.go:58\ngo.temporal.io/server/common/persistence/cassandra.(*ClusterMetadataStore).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/cassandra/cluster_metadata_store.go:120\ngo.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/clusterMetadataStore.go:127\ngo.temporal.io/server/common/persistence.(*clusterMetadataRateLimitedPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRateLimitedClients.go:993\ngo.temporal.io/server/common/persistence.(*clusterMetadataPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1393\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:972\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:976\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB.func1\n\t/home/builder/temporal/common/cluster/metadata.go:517\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/pagingIterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/pagingIterator.go:52\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB\n\t/home/builder/temporal/common/cluster/metadata.go:534\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshClusterMetadata\n\t/home/builder/temporal/common/cluster/metadata.go:404\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshLoop\n\t/home/builder/temporal/common/cluster/metadata.go:391\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:64"}

And operation ListClusterMetadata encountered gocql: no hosts available in the pool

{"level":"error","ts":"2023-05-02T10:32:17.329Z","msg":"Operation failed with internal error.","error":"operation ListClusterMetadata encountered gocql: no hosts available in the pool","metric-scope":81,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*clusterMetadataPersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1397\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:972\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*clusterMetadataRetryablePersistenceClient).ListClusterMetadata\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:976\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB.func1\n\t/home/builder/temporal/common/cluster/metadata.go:517\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/pagingIterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/pagingIterator.go:52\ngo.temporal.io/server/common/cluster.(*metadataImpl).listAllClusterMetadataFromDB\n\t/home/builder/temporal/common/cluster/metadata.go:534\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshClusterMetadata\n\t/home/builder/temporal/common/cluster/metadata.go:404\ngo.temporal.io/server/common/cluster.(*metadataImpl).refreshLoop\n\t/home/builder/temporal/common/cluster/metadata.go:391\ngo.temporal.io/server/internal/goro.(*Handle).Go.func1\n\t/home/builder/temporal/internal/goro/goro.go:64"}
hema-kishore-gunda commented 1 year ago

We have hit the exact same problem. We have cassandra 3 node with pod disruption allowed as 1. We had a cluster upgrade where in we had 2 out of 3 cassandra nodes were available and the temporal server pods were restarted as part of upgrade. Old pods have been terminated and the new pods were failing to load with out any error message on the logs. Server pods have started working once all the cassandra nodes have been completely restarted.

This issue exists on the latest version(1.20.2) of the temporal server as well.

image
yiminc commented 1 year ago

It maybe fixed by gocql 1.4.0 https://github.com/gocql/gocql/releases/tag/v1.4.0 which is included in temporal server v1.21.5

mustaFAB53 commented 1 year ago

We are connecting to an existing cassandra setup and seeing this issue even for latest temporal image with tag v1.21.5

yiminc commented 1 year ago

From the error message, it seems even if the config provided multiple IP addresses as hosts, the gocql still fail to connect even if a single host was down. We log the error when try to reconnect. From the above message, "gocql wrapper: unable to refresh gocql session","error":"gocql: unable to create session: unable to connect to initial hosts: dial tcp x.x.x.165:9042: i/o timeout", it seems either there is only one host configured or the gocql driver is not working as expected.