uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.96k stars 772 forks source link

Persist failover history in DomainInfo data #6139

Closed fimanishi closed 1 week ago

fimanishi commented 1 week ago

Persist failover history in DomainInfo data

What changed? Added functionality to persist recent failover event data in the DomainInfo whenever a valid failover is executed. FailoverEvent contains the failover timestamp, fromCluster, toCluster, and FailoverType("Force"/"Grace") information. FailoverHistory is the key in the DomainInfo data. It's a slice stored as a string containing the FailoverEvents, with max size defined by dynamicconfig.FrontendFailoverHistoryMaxSize. It has a default value of 5 and domain filter allowed. FailoverHistory always keep the n most recent FailoverEvents and it's sorted by descending timestamp.

Why? Persist failover information, improving failover visibility to clients and the cadence team.

How did you test it? Unit tests and integration tests. Tested locally, triggering failovers in a multiple Cadence clusters with replication environment.

Potential risks The change does not affect the logic of UpdateDomain. It adds the failover info to the DomainInfo data. The main risk is that we introduce something that can cause the code to panic. Errors while adding the FailoverHistory are logged as warnings and do not return/interrupt the UpdateDomain action.

Release notes

Documentation Changes

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 80.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.65%. Comparing base (91c09ef) to head (9040f45). Report is 2 commits behind head on master.

:exclamation: Current head 9040f45 differs from pull request most recent head 87ee3e4

Please upload reports for the commit 87ee3e4 to get more accurate results.

Additional details and impacted files | [Files](https://app.codecov.io/gh/uber/cadence/pull/6139?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [service/frontend/config/config.go](https://app.codecov.io/gh/uber/cadence/pull/6139?src=pr&el=tree&filepath=service%2Ffrontend%2Fconfig%2Fconfig.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-c2VydmljZS9mcm9udGVuZC9jb25maWcvY29uZmlnLmdv) | `100.00% <100.00%> (ø)` | | | [common/domain/handler.go](https://app.codecov.io/gh/uber/cadence/pull/6139?src=pr&el=tree&filepath=common%2Fdomain%2Fhandler.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL2RvbWFpbi9oYW5kbGVyLmdv) | `93.41% <78.94%> (+35.92%)` | :arrow_up: | ... and [8 files with indirect coverage changes](https://app.codecov.io/gh/uber/cadence/pull/6139/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/uber/cadence/pull/6139?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/uber/cadence/pull/6139?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Last update [70f11e4...87ee3e4](https://app.codecov.io/gh/uber/cadence/pull/6139?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber).
coveralls commented 1 week ago

Pull Request Test Coverage Report for Build 01903767-e856-411d-8370-2ff33037b8e2

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/domain/handler.go 26 30 86.67%
common/constants.go 0 8 0.0%
<!-- Total: 27 39 69.23% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
service/matching/tasklist/task_list_manager.go 2 77.05%
common/persistence/historyManager.go 2 66.67%
service/history/task/task.go 3 84.81%
service/history/task/timer_standby_task_executor.go 3 85.63%
service/history/task/transfer_active_task_executor.go 4 72.77%
service/history/execution/cache.go 6 74.61%
service/history/task/task_util.go 24 69.43%
<!-- Total: 46 -->
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.001%
Covered Lines: 106672
Relevant Lines: 149198

💛 - Coveralls
coveralls commented 1 week ago

Pull Request Test Coverage Report for Build 019037a0-7d0e-45d3-b574-308220d36860

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/domain/handler.go 26 30 86.67%
common/constants.go 0 8 0.0%
<!-- Total: 27 39 69.23% -->
Files with Coverage Reduction New Missed Lines %
service/history/queue/timer_queue_processor_base.go 1 78.28%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
common/task/parallel_task_processor.go 2 93.06%
service/matching/tasklist/task_list_manager.go 2 76.65%
service/matching/tasklist/matcher.go 2 90.91%
common/persistence/historyManager.go 2 66.67%
common/persistence/statsComputer.go 3 98.21%
service/history/task/fetcher.go 3 86.6%
common/types/history.go 4 45.35%
common/task/fifo_task_scheduler.go 4 83.51%
<!-- Total: 30 -->
Totals Coverage Status
Change from base Build 0190377e-f43d-435a-8a67-a08ecb9832b7: 0.02%
Covered Lines: 106707
Relevant Lines: 149198

💛 - Coveralls
davidporter-id-au commented 1 week ago

It looks ok, but do we have a project on audit logging all domain changes? This is definitely a good enough solution to cover some simple questions about when the domain failed over, but I'm not sure if it is the only question that we have about domain changes.

I don't know if I responded to this question elsewhere, so adding here as well: I 100% agree, this is not at all a serious attempt to to track and audit changes, configuration changes, security-relevant changes and similar. I think there's a very good argument that could be made for creating such a stream - for things like this - for access logs, for long-term tracking and similar.

Such an architecture probably should be different, I can't imagine it'd be something we really want to leave in CAAS, it's much better suited to be just another stream in Kafka and shoved in hive for a while or something likely.

however, such an event log is not currently the main focus of what we're working on. I think it's a good idea and that there's probably value in us doing it, particularly as we focus more on the authorization-by-default components.

This tiny user-facing log of events is just enough to get us over the line for the failover project, and it shouldn't preclude us to doing a more serious event-log in the future.