Replication tasks referencing archived workflow executions can't be processed, blocking all replication

Carried over from this community thread: https://community.temporal.io/t/what-is-the-correct-way-to-disable-re-enable-multi-cluster-replication/8216?u=emmercm

Expected Behavior

When two clusters are replicating to each other, and one is taken offline for an extended period of time (longer than namespace retention windows), then when the cluster is brought back online it should catch up on replication.

Actual Behavior

No workflow history replication is occurring, including workflows newly started after the secondary cluster was brought back online.

The trio of error logs that I see constantly coming from the primary cluster's history service are, in order:

{
"msg": "Persistent fetch operation Failure",
"wf-run-id": "...",
"store-operation": "get-wf-execution",
"shard-id": 383,
"address": "...:7234",
"wf-namespace-id": "...",
"stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/workflow.getWorkflowExecution\n\t/home/builder/temporal/service/history/workflow/transaction_impl.go:423\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState\n\t/home/builder/temporal/service/history/workflow/context.go:263\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).processReplication\n\t/home/builder/temporal/service/history/replication/ack_manager.go:582\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).generateHistoryReplicationTask\n\t/home/builder/temporal/service/history/replication/ack_manager.go:431\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).toReplicationTask\n\t/home/builder/temporal/service/history/replication/ack_manager.go:356\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).getTasks\n\t/home/builder/temporal/service/history/replication/ack_manager.go:280\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).GetTasks\n\t/home/builder/temporal/service/history/replication/ack_manager.go:224\ngo.temporal.io/server/service/history/api/replication.GetTasks\n\t/home/builder/temporal/service/history/api/replication/get_tasks.go:60\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/home/builder/temporal/service/history/historyEngine.go:750\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/home/builder/temporal/service/history/handler.go:1417",
"level": "error",
"wf-id": "...",
"error": "context canceled",
"logging-call-at": "transaction_impl.go:423",
"ts": "2023-05-16T17:39:45.746Z"
}

{
"msg": "replication task reader encounter error, return earlier",
"component": "replicator-queue-processor",
"shard-id": 383,
"address": "...:7234",
"stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).getTasks\n\t/home/builder/temporal/service/history/replication/ack_manager.go:281\ngo.temporal.io/server/service/history/replication.(*ackMgrImpl).GetTasks\n\t/home/builder/temporal/service/history/replication/ack_manager.go:224\ngo.temporal.io/server/service/history/api/replication.GetTasks\n\t/home/builder/temporal/service/history/api/replication/get_tasks.go:60\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/home/builder/temporal/service/history/historyEngine.go:750\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/home/builder/temporal/service/history/handler.go:1417",
"level": "error",
"error": "context canceled",
"logging-call-at": "ack_manager.go:281",
"ts": "2023-05-16T17:39:45.746Z"
}

{
"msg": "Failed to retrieve replication messages.",
"shard-id": 383,
"address": "...:7234",
"stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/api/replication.GetTasks\n\t/home/builder/temporal/service/history/api/replication/get_tasks.go:66\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/home/builder/temporal/service/history/historyEngine.go:750\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/home/builder/temporal/service/history/handler.go:1417",
"level": "error",
"error": "context canceled",
"logging-call-at": "get_tasks.go:66",
"ts": "2023-05-16T17:39:45.746Z"
}

The Persistent fetch operation Failure error seems to be the root problem. I would have expected shard.Context.GetWorkflowExecution() to return serviceerror.NotFound if the old workflows couldn't be found, though, so I'm confused by that.

Some other metrics, carried over from the linked community thread:

The primary cluster's:
- Metric persistence_error_with_type{operation="getreplicationtasks"} with error_type="serviceerrorunavailable" is emitting at a fairly constant rate
- Metric replication_tasks_fetched is a flat zero
- Table replication_tasks is only being INSERTed to, never DELETEd from. It has >2.3mil rows.
- DB has no obvious errors or timeouts.

I'm happy to gather any other metrics that would help debug the issue.

Given the ORDER BY on SELECT task_id, data, data_encoding FROM replication_tasks WHERE shard_id = ? AND task_id >= ? AND task_id < ? ORDER BY task_id LIMIT ?, I don't think this will ever resolve on its own.

Steps to Reproduce the Problem

Have two Temporal clusters:
1. With 512 history shards
2. In multi-cluster replication, and observe it is working as expected
Have all namespaces with:
1. A default 72h retention period
2. The default 4 task queue partitions
Have all namespaces active in the "primary" cluster, none active in the "secondary cluster"
Scale the secondary cluster down to zero replicas
Wait an extended period of time, e.g. 2 weeks
1. During this time, the primary cluster is still processing workflows, at a rate of ~240/hour for a total of ~140k completed while the secondary cluster is offline
Scale the secondary cluster back above zero replicas
Observe that no replication is occurring, based on the metrics above

Specifications

Version: Temporal server v1.19.1
Platform: Kubernetes & Docker temporalio/server:1.19.1, docker.io/temporalio/server@sha256:c8a5cdb7c78d26c9d611ce19abb62733dfe5480e02d40a39968bd9b2ab8b45c2
Persistence store: MySQL v8 via Vitess

temporalio / temporal