temporalio / temporal

Temporal service
https://docs.temporal.io
MIT License
11.51k stars 823 forks source link

Replication tasks referencing archived workflow executions can't be processed, blocking all replication #4348

Open emmercm opened 1 year ago

emmercm commented 1 year ago

Carried over from this community thread: https://community.temporal.io/t/what-is-the-correct-way-to-disable-re-enable-multi-cluster-replication/8216?u=emmercm

Expected Behavior

When two clusters are replicating to each other, and one is taken offline for an extended period of time (longer than namespace retention windows), then when the cluster is brought back online it should catch up on replication.

Actual Behavior

No workflow history replication is occurring, including workflows newly started after the secondary cluster was brought back online.

The trio of error logs that I see constantly coming from the primary cluster's history service are, in order:

The Persistent fetch operation Failure error seems to be the root problem. I would have expected shard.Context.GetWorkflowExecution() to return serviceerror.NotFound if the old workflows couldn't be found, though, so I'm confused by that.

Some other metrics, carried over from the linked community thread:

I'm happy to gather any other metrics that would help debug the issue.

Given the ORDER BY on SELECT task_id, data, data_encoding FROM replication_tasks WHERE shard_id = ? AND task_id >= ? AND task_id < ? ORDER BY task_id LIMIT ?, I don't think this will ever resolve on its own.

Steps to Reproduce the Problem

  1. Have two Temporal clusters:
    1. With 512 history shards
    2. In multi-cluster replication, and observe it is working as expected
  2. Have all namespaces with:
    1. A default 72h retention period
    2. The default 4 task queue partitions
  3. Have all namespaces active in the "primary" cluster, none active in the "secondary cluster"
  4. Scale the secondary cluster down to zero replicas
  5. Wait an extended period of time, e.g. 2 weeks
    1. During this time, the primary cluster is still processing workflows, at a rate of ~240/hour for a total of ~140k completed while the secondary cluster is offline
  6. Scale the secondary cluster back above zero replicas
  7. Observe that no replication is occurring, based on the metrics above

Specifications

emmercm commented 1 year ago

I was able to reproduce this locally with Docker Compose. The steps were:

  1. Have two temporalio/auto-setup:1.19.1 containers, both backed by MySQL
  2. Set up multi-cluster replication between the two
  3. Start a workflow worker application that is connected to the primary cluster only
  4. Start 100 workflows in the primary cluster, have the worker application process & complete them
  5. Observe the workflow history is replicated to the secondary cluster, via the secondary cluster's web UI
  6. Start triggering 60k workflows in the primary cluster
  7. Wait ~10sec, then observe that some of the workflows that have been started so far have been replicated to the secondary cluster
  8. Stop the secondary cluster's container
  9. Wait until every workflow has been completed or timed out, as observed in the primary cluster's web UI
  10. Stop the primary cluster's container (in order to purge any kind of in-memory caches)
  11. Observe that the primary cluster's MySQL has ~282k replication_tasks rows and ~39k executions rows (some of the workflows timed out)
  12. Delete every row in the primary cluster's executions and current_executions tables
  13. Re-start the primary cluster's container
  14. Re-start the secondary cluster's container
  15. After waiting >10min, continue to observe:
    1. The primary cluster's MySQL still has ~282k replication_tasks rows
    2. The primary cluster is emitting logs described in the original post

Then, to see what would happen with newly started workflows with both clusters running, I did:

  1. Start another 100 workflows in the primary cluster
  2. Keep both cluster containers running
  3. Observe that the workflows were all processed & completed by the worker application that was never stopped
  4. Observe that the workflows were never replicated to the secondary cluster