uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.96k stars 772 forks source link

Refactor/removing cross cluster feature #6121

Closed davidporter-id-au closed 4 days ago

davidporter-id-au commented 3 weeks ago

What changed? This *mostly** removes the cross-cluster feature.

Background

The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region.

Problems

The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow. The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable. The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error https://github.com/uber/cadence/pull/5919

Decision to deprecate and alternatives

As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available.

If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed.

PR details

This is a fairly high-risk refactor so it'll take some time to land. Broadly it:

Notable callouts

Testing

This is a pretty high risk change and the bar for testing should be fairly high, so I'll update the manual testing in this table as it's done:

test status
checking a simple hello world workflow passed
Simple parent/child workflow passed
parent close policy - cancel child wf fixed/passed
parent close policy - terminate child wf fixed/passed
parent close policy - abandon child wf fixed/passed
child wf closing - completion passed
child wf closing - term passed
child wf closing - cancel passed

there's obviously a bunch more possibilities with continue-as-new here too, but at a certain point I'm giong to have to rely on automation. There's been extremely little changes to the integration tests

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 91.60305% with 11 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (34cfbb3) to head (13bfc7e).

:exclamation: Current head 13bfc7e differs from pull request most recent head 2abd7e5

Please upload reports for the commit 2abd7e5 to get more accurate results.

Additional details and impacted files | [Files](https://app.codecov.io/gh/uber/cadence/pull/6121?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [common/persistence/data\_manager\_interfaces.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fdata_manager_interfaces.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL2RhdGFfbWFuYWdlcl9pbnRlcmZhY2VzLmdv) | `95.48% <100.00%> (-0.03%)` | :arrow_down: | | [common/persistence/data\_store\_interfaces.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fdata_store_interfaces.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL2RhdGFfc3RvcmVfaW50ZXJmYWNlcy5nbw==) | `100.00% <ø> (ø)` | | | [common/persistence/execution\_manager.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fexecution_manager.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL2V4ZWN1dGlvbl9tYW5hZ2VyLmdv) | `88.05% <ø> (-0.11%)` | :arrow_down: | | [common/persistence/metered.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fmetered.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL21ldGVyZWQuZ28=) | `0.00% <ø> (ø)` | | | [...n/persistence/nosql/nosqlplugin/cassandra/shard.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fnosql%2Fnosqlplugin%2Fcassandra%2Fshard.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL25vc3FsL25vc3FscGx1Z2luL2Nhc3NhbmRyYS9zaGFyZC5nbw==) | `100.00% <ø> (ø)` | | | [common/persistence/serialization/getters.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fserialization%2Fgetters.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL3NlcmlhbGl6YXRpb24vZ2V0dGVycy5nbw==) | `93.20% <ø> (-0.07%)` | :arrow_down: | | [common/persistence/shard\_manager.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fshard_manager.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL3NoYXJkX21hbmFnZXIuZ28=) | `90.19% <100.00%> (+2.92%)` | :arrow_up: | | [common/persistence/sql/sql\_shard\_store.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2Fsql%2Fsql_shard_store.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL3NxbC9zcWxfc2hhcmRfc3RvcmUuZ28=) | `96.12% <100.00%> (-0.30%)` | :arrow_down: | | [common/persistence/statsComputer.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=common%2Fpersistence%2FstatsComputer.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3BlcnNpc3RlbmNlL3N0YXRzQ29tcHV0ZXIuZ28=) | `95.02% <ø> (-0.12%)` | :arrow_down: | | [...rvice/history/engine/engineimpl/describe\_queues.go](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree&filepath=service%2Fhistory%2Fengine%2Fengineimpl%2Fdescribe_queues.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-c2VydmljZS9oaXN0b3J5L2VuZ2luZS9lbmdpbmVpbXBsL2Rlc2NyaWJlX3F1ZXVlcy5nbw==) | `0.00% <ø> (ø)` | | | ... and [12 more](https://app.codecov.io/gh/uber/cadence/pull/6121?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | | ... and [42 files with indirect coverage changes](https://app.codecov.io/gh/uber/cadence/pull/6121/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/uber/cadence/pull/6121?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/uber/cadence/pull/6121?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Last update [34cfbb3...2abd7e5](https://app.codecov.io/gh/uber/cadence/pull/6121?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber).
coveralls commented 3 weeks ago

Pull Request Test Coverage Report for Build 018fee66-47e6-45df-9b2d-f982d0e49d3a

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
service/history/task/transfer_standby_task_executor.go 2 86.94%
common/task/parallel_task_processor.go 2 93.06%
service/history/replication/task_processor.go 2 82.76%
common/util.go 2 91.84%
service/matching/tasklist/task_writer.go 2 82.21%
common/persistence/metered.go 2 80.87%
service/history/task/fetcher.go 2 83.13%
service/history/execution/mutable_state_builder.go 3 78.26%
service/history/handler/handler.go 4 96.43%
common/persistence/wrappers/errorinjectors/utils.go 6 91.41%
<!-- Total: 309 -->
Totals Coverage Status
Change from base Build 018fedb8-ecd7-4675-ba4a-3dd7f0818e3a: -0.1%
Covered Lines: 103817
Relevant Lines: 145939

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 019009dd-83db-4f9b-9198-5b8b217ab621

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 88.06%
service/matching/tasklist/db.go 2 73.23%
service/history/replication/task_processor.go 2 82.76%
common/util.go 2 91.84%
common/persistence/metered.go 2 80.87%
common/persistence/historyManager.go 2 66.67%
common/log/tag/tags.go 3 50.46%
common/persistence/nosql/nosql_task_store.go 3 85.52%
common/task/fifo_task_scheduler.go 3 84.54%
service/history/execution/mutable_state_builder.go 3 78.26%
<!-- Total: 341 -->
Totals Coverage Status
Change from base Build 018fedb8-ecd7-4675-ba4a-3dd7f0818e3a: -0.1%
Covered Lines: 103640
Relevant Lines: 145776

💛 - Coveralls
davidporter-id-au commented 1 week ago

I think it should be safe to delete. If there is still some dangling code related to the cross-cluster feature, it should be safe to clean up afterward.

appreciate the review, it's annoyingly large. Yeah, this is basically roughly only half of the changes, I've not touched anything in persistence yet. I expect there'll be a few other bits dangling as well

coveralls commented 5 days ago

Pull Request Test Coverage Report for Build 01905267-e947-48a2-a33f-1ee19581946d

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/task/parallel_task_processor.go 2 93.06%
common/persistence/metered.go 2 80.87%
service/history/queue/timer_queue_processor_base.go 3 77.87%
service/history/execution/mutable_state_builder.go 3 78.26%
service/history/task/transfer_standby_task_executor.go 4 87.35%
service/history/handler/handler.go 4 96.43%
common/task/fifo_task_scheduler.go 4 83.51%
service/frontend/api/handler.go 4 75.68%
<!-- Total: 317 -->
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104302
Relevant Lines: 146057

💛 - Coveralls
coveralls commented 5 days ago

Pull Request Test Coverage Report for Build 019052cc-e584-4973-80fd-d356acfcec68

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.93%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
common/task/fifo_task_scheduler.go 2 87.63%
common/persistence/metered.go 2 80.87%
service/matching/tasklist/matcher.go 2 90.18%
service/matching/tasklist/task_reader.go 2 77.72%
service/history/execution/mutable_state_builder.go 3 78.26%
common/persistence/statsComputer.go 3 98.18%
service/history/task/transfer_standby_task_executor.go 4 87.35%
common/archiver/filestore/historyArchiver.go 4 80.95%
<!-- Total: 327 -->
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104314
Relevant Lines: 146061

💛 - Coveralls
coveralls commented 4 days ago

Pull Request Test Coverage Report for Build 019056c1-7321-41c4-9414-556ee6511194

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
common/task/fifo_task_scheduler.go 2 84.54%
common/persistence/metered.go 2 80.87%
service/matching/tasklist/matcher.go 2 90.91%
service/matching/tasklist/task_reader.go 2 77.72%
service/history/task/task.go 3 84.81%
service/history/execution/mutable_state_builder.go 3 78.39%
common/persistence/statsComputer.go 3 98.18%
service/history/handler/handler.go 4 96.43%
service/history/queue/timer_queue_processor_base.go 4 77.66%
<!-- Total: 340 -->
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104279
Relevant Lines: 146061

💛 - Coveralls
coveralls commented 4 days ago

Pull Request Test Coverage Report for Build 01905720-e130-46fd-aa35-96725d16add5

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
<!-- Total: 118 134 88.06% -->
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
service/history/task/transfer_standby_task_executor.go 2 87.04%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
service/matching/tasklist/task_list_manager.go 2 76.65%
common/persistence/sql/sqlplugin/mysql/task.go 2 73.68%
common/persistence/metered.go 2 80.87%
common/membership/hashring.go 2 84.69%
service/matching/tasklist/matcher.go 2 90.91%
service/matching/tasklist/task_reader.go 2 77.72%
common/persistence/sql/sqlplugin/mysql/db.go 2 79.49%
<!-- Total: 314 -->
Totals Coverage Status
Change from base Build 019056dc-98d1-4fa6-b475-a7aef51f4b90: -0.1%
Covered Lines: 104692
Relevant Lines: 146539

💛 - Coveralls