uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.96k stars 772 forks source link

Global ratelimiter: everything else #6141

Closed Groxx closed 2 days ago

Groxx commented 1 week ago

After too many attempts to break this apart and build different portions in self-contained ways, and running into various inter-dependent roadblocks... I just gave up and did it all at once.

Rollout plan for people who don't want or need this system

Do nothing :)

As of this PR, you'll use "disabled" and that should be as close to "no changes at all" as possible. Soon, you'll get "local", and then you'll have some new metrics you can use (or ignore) but otherwise no behavior changes.

And that'll be it. The "global" load-balanced stuff is likely to remain opt-in.

Rollout plan for us

For deployment: any order is fine / should not behave (too) badly. Even if "global" or either shadow mode is selected on the initial deploy. Frontends will have background RatelimitUpdate request failures until History is deployed, but that'll just mean it continues to use the "local" internal fallback and that's in practice the same behavior as "local" or "disabled", just slightly noisier.

The smoothest deployment is: deploy everything on "disabled" or "local" (the default(s), so no requests are sent until deploy is done), then switch to "local-shadow-global" to warm global limiters / check that it's working, then "global" to use the global behavior.

Rolling back is just the opposite. Ideally disable things first to stop the requests, but even if you don't it should be fine.

In more detail:

  1. At merge time, this will set the "key mode" (frontend.globalRatelimiterMode) to "disabled", which gets as close as is reasonably possible to acting exactly like it did before this PR.
    • This is also effectively the panic button for the initial rollout.
  2. Once that proves to not immediately explode, switch to "local" for all keys. This will keep the current ratelimiter rates, but will start collecting and emitting ratelimiter-usage metrics, so we can make sure that doesn't explode either (and update dashboards, etc).
    • "local" will eventually become the new default and I'll remove "disabled" as it's the same behavior but I think we'll want to keep the metrics.
  3. Probably switch everything over to "local-shadow-global" so we start using the global system and emitting its metrics too, so we can make sure it doesn't seem like it'll explode / be surprisingly worse / etc.
    • pprof it to make sure running costs are in expected bounds
  4. Start switching individual domains over to "global" and lowering their RPS back to where we intend, rather than their current artificially-raised-to-mitigate-load-imbalance values.
    • This is done by making frontend.globalRatelimiterMode return "global" for keys like .*:my-domain (to catch user:my-domain, worker:my-domain, etc).
    • In the built-in dynamic configs, this looks like: constraints: {ratelimitKey: "user:my-domain"}
  5. If all goes well, we'll probably switch everyone over to "global" soonish, and we can retain "local" for edge cases that we didn't expect, where the old behavior works better.

The changes in a nutshell

(... I guess it's a coconut, given the size)

This PR includes:

Testing

Aside from the unit tests here, I've locally run all this with the new development_instance2.yaml file, made some domains / sent some requests, watched where requests went / how weights changed / when GC occurred / etc. After some bug fixes and the "GC locally after 5 idle periods" change, it seems to be doing exactly what I want it to do, including adjusting as I start and stop the extra instance(s).

I would like to build a multi-instance cluster test (or a docker-compose.yaml at the very least) for a variety of kinds of tests, but I wasn't able to find anything that looked promising to build off, and I didn't want to spend a week figuring one out from scratch :\ I'm open to trying if someone has concrete ideas though.

Future changes, roughly in priority order

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 68.67470% with 156 lines in your changes missing coverage. Please review.

Project coverage is 72.64%. Comparing base (03d9a2e) to head (2a3b361). Report is 2 commits behind head on master.

:exclamation: Current head 2a3b361 differs from pull request most recent head 1f37531

Please upload reports for the commit 1f37531 to get more accurate results.

Additional details and impacted files | [Files](https://app.codecov.io/gh/uber/cadence/pull/6141?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [client/history/client.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=client%2Fhistory%2Fclient.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y2xpZW50L2hpc3RvcnkvY2xpZW50Lmdv) | `79.65% <100.00%> (+7.25%)` | :arrow_up: | | [common/quotas/collection.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fquotas%2Fcollection.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3F1b3Rhcy9jb2xsZWN0aW9uLmdv) | `100.00% <ø> (ø)` | | | [...ommon/quotas/global/collection/internal/limiter.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fquotas%2Fglobal%2Fcollection%2Finternal%2Flimiter.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3F1b3Rhcy9nbG9iYWwvY29sbGVjdGlvbi9pbnRlcm5hbC9saW1pdGVyLmdv) | `96.42% <100.00%> (-3.58%)` | :arrow_down: | | [common/quotas/multistageratelimiter.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fquotas%2Fmultistageratelimiter.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3F1b3Rhcy9tdWx0aXN0YWdlcmF0ZWxpbWl0ZXIuZ28=) | `88.23% <100.00%> (ø)` | | | [common/types/mapper/proto/history.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Ftypes%2Fmapper%2Fproto%2Fhistory.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3R5cGVzL21hcHBlci9wcm90by9oaXN0b3J5Lmdv) | `99.23% <100.00%> (+<0.01%)` | :arrow_up: | | [service/frontend/config/config.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=service%2Ffrontend%2Fconfig%2Fconfig.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-c2VydmljZS9mcm9udGVuZC9jb25maWcvY29uZmlnLmdv) | `100.00% <100.00%> (ø)` | | | [client/history/peer\_resolver.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=client%2Fhistory%2Fpeer_resolver.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y2xpZW50L2hpc3RvcnkvcGVlcl9yZXNvbHZlci5nbw==) | `96.72% <92.59%> (-3.28%)` | :arrow_down: | | [common/dynamicconfig/filter.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fdynamicconfig%2Ffilter.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL2R5bmFtaWNjb25maWcvZmlsdGVyLmdv) | `46.47% <0.00%> (-2.06%)` | :arrow_down: | | [common/quotas/global/rpc/error.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fquotas%2Fglobal%2Frpc%2Ferror.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3F1b3Rhcy9nbG9iYWwvcnBjL2Vycm9yLmdv) | `25.00% <25.00%> (ø)` | | | [common/quotas/global/algorithm/requestweighted.go](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree&filepath=common%2Fquotas%2Fglobal%2Falgorithm%2Frequestweighted.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-Y29tbW9uL3F1b3Rhcy9nbG9iYWwvYWxnb3JpdGhtL3JlcXVlc3R3ZWlnaHRlZC5nbw==) | `94.48% <60.00%> (-5.52%)` | :arrow_down: | | ... and [10 more](https://app.codecov.io/gh/uber/cadence/pull/6141?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | | ... and [53 files with indirect coverage changes](https://app.codecov.io/gh/uber/cadence/pull/6141/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/uber/cadence/pull/6141?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/uber/cadence/pull/6141?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Last update [03d9a2e...1f37531](https://app.codecov.io/gh/uber/cadence/pull/6141?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber).
coveralls commented 1 week ago

Pull Request Test Coverage Report for Build 01902e4f-dc36-4239-92f4-af9b6ee1bc99

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
client/history/peer_resolver.go 37 39 94.87%
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 17 25 68.0%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
<!-- Total: 605 796 76.01% -->
Files with Coverage Reduction New Missed Lines %
service/history/queue/timer_queue_processor_base.go 1 77.66%
service/history/shard/context.go 2 79.13%
common/task/parallel_task_processor.go 2 93.06%
common/peerprovider/ringpopprovider/config.go 2 81.58%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/task/fifo_task_scheduler.go 2 85.57%
service/frontend/api/handler.go 2 75.62%
service/history/task/fetcher.go 3 85.57%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/history/task/transfer_active_task_executor.go 4 72.77%
<!-- Total: 142 -->
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.03%
Covered Lines: 107043
Relevant Lines: 149777

💛 - Coveralls
coveralls commented 1 week ago

Pull Request Test Coverage Report for Build 019037db-5969-4cb6-a83d-760851588f21

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
client/history/peer_resolver.go 38 44 86.36%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
<!-- Total: 616 824 74.76% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/persistence/historyManager.go 2 66.67%
service/history/task/task.go 3 84.81%
common/task/fifo_task_scheduler.go 3 84.54%
service/history/task/timer_standby_task_executor.go 3 85.63%
service/history/task/transfer_active_task_executor.go 4 72.77%
service/history/execution/cache.go 6 74.61%
service/history/execution/mutable_state_decision_task_manager.go 8 89.18%
host/testcluster.go 16 68.73%
<!-- Total: 136 -->
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.02%
Covered Lines: 107082
Relevant Lines: 149803

💛 - Coveralls
coveralls commented 6 days ago

Pull Request Test Coverage Report for Build 01904c1d-8ac5-47d2-8f14-b71d02715363

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
client/history/peer_resolver.go 38 44 86.36%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
<!-- Total: 689 824 83.62% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/peerprovider/ringpopprovider/config.go 2 81.58%
service/matching/tasklist/task_list_manager.go 2 77.05%
common/quotas/global/collection/internal/limiter.go 2 97.56%
service/frontend/api/handler.go 2 75.62%
service/history/task/task.go 3 84.81%
service/history/task/timer_standby_task_executor.go 3 85.63%
tools/cli/admin_db_decode_thrift.go 3 69.23%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/history/task/transfer_active_task_executor.go 4 72.77%
<!-- Total: 164 -->
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.001%
Covered Lines: 107105
Relevant Lines: 149803

💛 - Coveralls
coveralls commented 3 days ago

Pull Request Test Coverage Report for Build 01905ac0-75d9-412d-be38-63dbc29251ea

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/log/tag/tags.go 9 15 60.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
<!-- Total: 699 850 82.24% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/task/fifo_task_scheduler.go 2 87.63%
service/frontend/api/handler.go 2 75.62%
common/membership/hashring.go 2 84.69%
service/history/handler/handler.go 3 95.65%
common/persistence/statsComputer.go 3 98.18%
<!-- Total: 33 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105261
Relevant Lines: 147236

💛 - Coveralls
coveralls commented 3 days ago

Pull Request Test Coverage Report for Build 01905aec-2f03-4488-81f1-7aff8cdf3c00

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
<!-- Total: 688 852 80.75% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/mapq/types/policy_collection.go 2 93.06%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.56%
service/frontend/api/handler.go 2 75.74%
common/persistence/historyManager.go 2 66.67%
service/history/handler/handler.go 3 95.65%
service/history/task/transfer_active_task_executor.go 3 71.09%
<!-- Total: 43 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105261
Relevant Lines: 147238

💛 - Coveralls
Groxx commented 3 days ago

@davidporter-id-au For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:

  • Are there any meaningful changes they should know about
  • Directionally, can they just do nothing and it'll remain-as-is for them?

Added deployment steps near the top of the commit message. Look good?

coveralls commented 3 days ago

Pull Request Test Coverage Report for Build 01905b7c-fcd3-46a3-9e31-6318be207dbc

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
<!-- Total: 684 848 80.66% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.64%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.37%
common/task/fifo_task_scheduler.go 2 85.57%
service/frontend/api/handler.go 2 75.62%
service/history/task/transfer_active_task_executor.go 2 71.17%
common/persistence/statsComputer.go 3 98.18%
common/archiver/filestore/historyArchiver.go 4 80.95%
<!-- Total: 21 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.07%
Covered Lines: 105274
Relevant Lines: 147232

💛 - Coveralls
coveralls commented 3 days ago

Pull Request Test Coverage Report for Build 01905bc1-d3c5-43ef-b8b1-17df18d6a6da

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 20 30 66.67%
<!-- Total: 686 848 80.9% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
common/cache/lru.go 2 93.01%
service/matching/tasklist/task_list_manager.go 2 76.65%
common/quotas/global/collection/internal/limiter.go 2 97.37%
common/task/fifo_task_scheduler.go 2 85.57%
service/history/shard/context.go 9 78.13%
<!-- Total: 19 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.09%
Covered Lines: 105301
Relevant Lines: 147232

💛 - Coveralls
coveralls commented 2 days ago

Pull Request Test Coverage Report for Build 019060c6-2bde-406c-bf4d-55ee12eb19fe

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 20 30 66.67%
<!-- Total: 688 850 80.94% -->
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.84%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.37%
service/matching/tasklist/task_list_manager.go 3 76.45%
common/task/fifo_task_scheduler.go 3 84.54%
common/persistence/statsComputer.go 3 98.18%
service/history/shard/context.go 9 78.13%
<!-- Total: 26 -->
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105257
Relevant Lines: 147234

💛 - Coveralls