Open emmercm opened 1 year ago
I was able to reproduce this locally with Docker Compose. The steps were:
temporalio/auto-setup:1.19.1
containers, both backed by MySQLreplication_tasks
rows and ~39k executions
rows (some of the workflows timed out)executions
and current_executions
tablesreplication_tasks
rowsThen, to see what would happen with newly started workflows with both clusters running, I did:
Carried over from this community thread: https://community.temporal.io/t/what-is-the-correct-way-to-disable-re-enable-multi-cluster-replication/8216?u=emmercm
Expected Behavior
When two clusters are replicating to each other, and one is taken offline for an extended period of time (longer than namespace retention windows), then when the cluster is brought back online it should catch up on replication.
Actual Behavior
No workflow history replication is occurring, including workflows newly started after the secondary cluster was brought back online.
The trio of error logs that I see constantly coming from the primary cluster's history service are, in order:
The
Persistent fetch operation Failure
error seems to be the root problem. I would have expectedshard.Context.GetWorkflowExecution()
to returnserviceerror.NotFound
if the old workflows couldn't be found, though, so I'm confused by that.Some other metrics, carried over from the linked community thread:
persistence_error_with_type{operation="getreplicationtasks"}
witherror_type="serviceerrorunavailable"
is emitting at a fairly constant ratereplication_tasks_fetched
is a flat zeroreplication_tasks
is only beingINSERT
ed to, neverDELETE
d from. It has >2.3mil rows.I'm happy to gather any other metrics that would help debug the issue.
Given the
ORDER BY
onSELECT task_id, data, data_encoding FROM replication_tasks WHERE shard_id = ? AND task_id >= ? AND task_id < ? ORDER BY task_id LIMIT ?
, I don't think this will ever resolve on its own.Steps to Reproduce the Problem
Specifications
temporalio/server:1.19.1
,docker.io/temporalio/server@sha256:c8a5cdb7c78d26c9d611ce19abb62733dfe5480e02d40a39968bd9b2ab8b45c2