uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.96k stars 772 forks source link

Bugfix: replication messaged dropped during host shutdown #6143

Closed davidporter-id-au closed 2 days ago

davidporter-id-au commented 1 week ago

What changed?

Internal details: CDNC-9597

A user reported some problems during a failover in which a workflow, during a continue-as-new event got dropped during replication silently, without any corresponding DLQ message. We were able to track down the (expected) cause to likely have been a shard movement during that time which triggers several unpleasant edge-conditions with interactions with the following:

How did you test it?

Tested locally and with unit tests. Was able to repro the sequence of events mostly with unit-tests.

coveralls commented 4 days ago

Pull Request Test Coverage Report for Build 01905832-5907-4782-b9bc-f369f8e4ddf4

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/replication/task_processor.go 14 15 93.33%
<!-- Total: 14 15 93.33% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
service/history/replication/task_processor.go 1 84.34%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/mapq/types/policy_collection.go 2 93.06%
common/cache/lru.go 2 93.01%
service/frontend/api/handler.go 2 75.62%
common/membership/hashring.go 2 84.69%
service/matching/tasklist/matcher.go 2 90.91%
common/persistence/historyManager.go 2 66.67%
service/history/handler/handler.go 3 96.23%
<!-- Total: 40 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.0%
Covered Lines: 104689
Relevant Lines: 146554

💛 - Coveralls