uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.99k stars 776 forks source link

Error refreshing domain cache #4933

Open vivek0079 opened 1 year ago

vivek0079 commented 1 year ago

Version of Cadence server, and client(which language) This is very important to root cause bugs.

Describe the bug Cadence server is not able to refresh the Domain cache when the Cassandra domain changes

To Reproduce Is the issue reproducible?

Steps to reproduce the behaviour:

  1. Start the cadence server along with Cassandra DB
  2. Rotate the Cassandra pods [imagine any pod issue]
  3. Now cadence will throw {"level":"error","msg":"Error refreshing domain cache","service":"cadence-frontend","error":"gocql: no hosts available in the pool","logging-call-at":"domainCache.go:401","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:401"}
  4. After this cadence will not be able to connect to Cassandra pods since the domain of the Cassandra pods have changes on the pod rotation

Expected behaviour

  1. When Cassandra pods are rotated the domain cache in cadence should be updated

Screenshots Logs - {"level":"error","ts":"2022-08-09T10:06:26.332Z","msg":"Operation failed with internal error.","service":"cadence-frontend","error":"gocql: no hosts available in the pool","metric-scope":42,"logging-call-at":"persistenceMetricClients.go:812","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*metadataPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:812\ngithub.com/uber/cadence/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/cadence/common/persistence/persistenceMetricClients.go:790\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshDomainsLocked\n\t/cadence/common/cache/domainCache.go:425\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshDomains\n\t/cadence/common/cache/domainCache.go:412\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:396"}

{"level":"error","ts":"2022-08-09T10:06:26.332Z","msg":"Error refreshing domain cache","service":"cadence-frontend","error":"gocql: no hosts available in the pool","logging-call-at":"domainCache.go:401","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:401"}

Additional context Add any other context about the problem here, E.g. Stackstace, workflow history.

talha-naeem1 commented 2 months ago

I'm facing this issue:

{"level":"error","ts":"2024-04-19T07:05:05.519Z","msg":"Error refreshing domain cache","service":"cadence-matching","error":"ListDomains timed out. Failed to get domain rows. Error: context deadline exceeded","logging-call-at":"domainCache.go:425","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:131\ngithub.com/uber/cadence/common/cache.(*domainCache).refreshLoop\n\t/cadence/common/cache/domainCache.go:425"}

Did someone find anything related to this?

demirkayaender commented 2 months ago

Looks like the query is failing on the storage layer. How is your Cassandra (or the storage you are using) metrics looking? You might need to scale up or out your storage.

Apart from this, just to check if your storage is running at all, are you able run workflows?