Closed Groxx closed 2 days ago
Attention: Patch coverage is 68.67470%
with 156 lines
in your changes missing coverage. Please review.
Project coverage is 72.64%. Comparing base (
03d9a2e
) to head (2a3b361
). Report is 2 commits behind head on master.:exclamation: Current head 2a3b361 differs from pull request most recent head 1f37531
Please upload reports for the commit 1f37531 to get more accurate results.
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
client/history/peer_resolver.go | 37 | 39 | 94.87% | ||
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/metrics/tags.go | 0 | 6 | 0.0% | ||
common/quotas/global/algorithm/requestweighted.go | 17 | 25 | 68.0% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
service/history/handler/handler.go | 24 | 34 | 70.59% | ||
<!-- | Total: | 605 | 796 | 76.01% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
service/history/queue/timer_queue_processor_base.go | 1 | 77.66% | ||
service/history/shard/context.go | 2 | 79.13% | ||
common/task/parallel_task_processor.go | 2 | 93.06% | ||
common/peerprovider/ringpopprovider/config.go | 2 | 81.58% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.56% | ||
common/task/fifo_task_scheduler.go | 2 | 85.57% | ||
service/frontend/api/handler.go | 2 | 75.62% | ||
service/history/task/fetcher.go | 3 | 85.57% | ||
common/archiver/filestore/historyArchiver.go | 4 | 80.95% | ||
service/history/task/transfer_active_task_executor.go | 4 | 72.77% | ||
<!-- | Total: | 142 | --> |
Totals | |
---|---|
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: | -0.03% |
Covered Lines: | 107043 |
Relevant Lines: | 149777 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
client/history/peer_resolver.go | 38 | 44 | 86.36% | ||
common/metrics/tags.go | 0 | 6 | 0.0% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
service/history/handler/handler.go | 24 | 34 | 70.59% | ||
<!-- | Total: | 616 | 824 | 74.76% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 2 | 88.56% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.56% | ||
common/persistence/historyManager.go | 2 | 66.67% | ||
service/history/task/task.go | 3 | 84.81% | ||
common/task/fifo_task_scheduler.go | 3 | 84.54% | ||
service/history/task/timer_standby_task_executor.go | 3 | 85.63% | ||
service/history/task/transfer_active_task_executor.go | 4 | 72.77% | ||
service/history/execution/cache.go | 6 | 74.61% | ||
service/history/execution/mutable_state_decision_task_manager.go | 8 | 89.18% | ||
host/testcluster.go | 16 | 68.73% | ||
<!-- | Total: | 136 | --> |
Totals | |
---|---|
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: | -0.02% |
Covered Lines: | 107082 |
Relevant Lines: | 149803 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
client/history/peer_resolver.go | 38 | 44 | 86.36% | ||
common/metrics/tags.go | 0 | 6 | 0.0% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
service/history/handler/handler.go | 24 | 34 | 70.59% | ||
<!-- | Total: | 689 | 824 | 83.62% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 2 | 88.56% | ||
common/peerprovider/ringpopprovider/config.go | 2 | 81.58% | ||
service/matching/tasklist/task_list_manager.go | 2 | 77.05% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.56% | ||
service/frontend/api/handler.go | 2 | 75.62% | ||
service/history/task/task.go | 3 | 84.81% | ||
service/history/task/timer_standby_task_executor.go | 3 | 85.63% | ||
tools/cli/admin_db_decode_thrift.go | 3 | 69.23% | ||
common/archiver/filestore/historyArchiver.go | 4 | 80.95% | ||
service/history/task/transfer_active_task_executor.go | 4 | 72.77% | ||
<!-- | Total: | 164 | --> |
Totals | |
---|---|
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: | -0.001% |
Covered Lines: | 107105 |
Relevant Lines: | 149803 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/log/tag/tags.go | 9 | 15 | 60.0% | ||
common/metrics/tags.go | 3 | 9 | 33.33% | ||
common/quotas/global/shared/keymapper.go | 8 | 14 | 57.14% | ||
client/history/peer_resolver.go | 38 | 46 | 82.61% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
<!-- | Total: | 699 | 850 | 82.24% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 1 | 89.05% | ||
client/history/peer_resolver.go | 1 | 89.89% | ||
service/history/task/transfer_standby_task_executor.go | 2 | 86.23% | ||
common/cache/lru.go | 2 | 93.01% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.56% | ||
common/task/fifo_task_scheduler.go | 2 | 87.63% | ||
service/frontend/api/handler.go | 2 | 75.62% | ||
common/membership/hashring.go | 2 | 84.69% | ||
service/history/handler/handler.go | 3 | 95.65% | ||
common/persistence/statsComputer.go | 3 | 98.18% | ||
<!-- | Total: | 33 | --> |
Totals | |
---|---|
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: | 0.06% |
Covered Lines: | 105261 |
Relevant Lines: | 147236 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/metrics/tags.go | 3 | 9 | 33.33% | ||
common/quotas/global/shared/keymapper.go | 8 | 14 | 57.14% | ||
client/history/peer_resolver.go | 38 | 46 | 82.61% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
<!-- | Total: | 688 | 852 | 80.75% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 1 | 89.05% | ||
client/history/peer_resolver.go | 1 | 89.89% | ||
service/history/task/transfer_standby_task_executor.go | 2 | 86.23% | ||
common/mapq/types/policy_collection.go | 2 | 93.06% | ||
common/cache/lru.go | 2 | 93.01% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.56% | ||
service/frontend/api/handler.go | 2 | 75.74% | ||
common/persistence/historyManager.go | 2 | 66.67% | ||
service/history/handler/handler.go | 3 | 95.65% | ||
service/history/task/transfer_active_task_executor.go | 3 | 71.09% | ||
<!-- | Total: | 43 | --> |
Totals | |
---|---|
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: | 0.06% |
Covered Lines: | 105261 |
Relevant Lines: | 147238 |
@davidporter-id-au For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:
- Are there any meaningful changes they should know about
- Directionally, can they just do nothing and it'll remain-as-is for them?
Added deployment steps near the top of the commit message. Look good?
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
common/dynamicconfig/config.go | 11 | 13 | 84.62% | ||
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/metrics/tags.go | 3 | 9 | 33.33% | ||
common/quotas/global/shared/keymapper.go | 8 | 14 | 57.14% | ||
client/history/peer_resolver.go | 38 | 46 | 82.61% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
<!-- | Total: | 684 | 848 | 80.66% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 1 | 89.05% | ||
client/history/peer_resolver.go | 1 | 89.89% | ||
service/history/task/transfer_standby_task_executor.go | 2 | 86.64% | ||
common/cache/lru.go | 2 | 93.01% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.37% | ||
common/task/fifo_task_scheduler.go | 2 | 85.57% | ||
service/frontend/api/handler.go | 2 | 75.62% | ||
service/history/task/transfer_active_task_executor.go | 2 | 71.17% | ||
common/persistence/statsComputer.go | 3 | 98.18% | ||
common/archiver/filestore/historyArchiver.go | 4 | 80.95% | ||
<!-- | Total: | 21 | --> |
Totals | |
---|---|
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: | 0.07% |
Covered Lines: | 105274 |
Relevant Lines: | 147232 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/metrics/tags.go | 3 | 9 | 33.33% | ||
common/quotas/global/shared/keymapper.go | 8 | 14 | 57.14% | ||
client/history/peer_resolver.go | 38 | 46 | 82.61% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
service/history/handler/handler.go | 20 | 30 | 66.67% | ||
<!-- | Total: | 686 | 848 | 80.9% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 1 | 89.05% | ||
client/history/peer_resolver.go | 1 | 89.89% | ||
common/cache/lru.go | 2 | 93.01% | ||
service/matching/tasklist/task_list_manager.go | 2 | 76.65% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.37% | ||
common/task/fifo_task_scheduler.go | 2 | 85.57% | ||
service/history/shard/context.go | 9 | 78.13% | ||
<!-- | Total: | 19 | --> |
Totals | |
---|---|
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: | 0.09% |
Covered Lines: | 105301 |
Relevant Lines: | 147232 |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
service/history/resource/resource.go | 13 | 15 | 86.67% | ||
common/quotas/global/rpc/error.go | 1 | 4 | 25.0% | ||
service/history/resource/resource_test_utils.go | 0 | 3 | 0.0% | ||
common/resource/resource_test_utils.go | 0 | 5 | 0.0% | ||
common/metrics/tags.go | 3 | 9 | 33.33% | ||
common/quotas/global/shared/keymapper.go | 8 | 14 | 57.14% | ||
client/history/peer_resolver.go | 38 | 46 | 82.61% | ||
common/quotas/global/algorithm/requestweighted.go | 22 | 30 | 73.33% | ||
common/types/mapper/thrift/any.go | 17 | 25 | 68.0% | ||
service/history/handler/handler.go | 20 | 30 | 66.67% | ||
<!-- | Total: | 688 | 850 | 80.94% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
common/task/weighted_round_robin_task_scheduler.go | 1 | 89.05% | ||
client/history/peer_resolver.go | 1 | 89.89% | ||
service/history/task/transfer_standby_task_executor.go | 2 | 86.84% | ||
common/cache/lru.go | 2 | 93.01% | ||
common/quotas/global/collection/internal/limiter.go | 2 | 97.37% | ||
service/matching/tasklist/task_list_manager.go | 3 | 76.45% | ||
common/task/fifo_task_scheduler.go | 3 | 84.54% | ||
common/persistence/statsComputer.go | 3 | 98.18% | ||
service/history/shard/context.go | 9 | 78.13% | ||
<!-- | Total: | 26 | --> |
Totals | |
---|---|
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: | 0.06% |
Covered Lines: | 105257 |
Relevant Lines: | 147234 |
After too many attempts to break this apart and build different portions in self-contained ways, and running into various inter-dependent roadblocks... I just gave up and did it all at once.
Rollout plan for people who don't want or need this system
Do nothing :)
As of this PR, you'll use "disabled" and that should be as close to "no changes at all" as possible. Soon, you'll get "local", and then you'll have some new metrics you can use (or ignore) but otherwise no behavior changes.
And that'll be it. The "global" load-balanced stuff is likely to remain opt-in.
Rollout plan for us
For deployment: any order is fine / should not behave (too) badly. Even if "global" or either shadow mode is selected on the initial deploy. Frontends will have background
RatelimitUpdate
request failures until History is deployed, but that'll just mean it continues to use the "local" internal fallback and that's in practice the same behavior as "local" or "disabled", just slightly noisier.The smoothest deployment is: deploy everything on "disabled" or "local" (the default(s), so no requests are sent until deploy is done), then switch to "local-shadow-global" to warm global limiters / check that it's working, then "global" to use the global behavior.
Rolling back is just the opposite. Ideally disable things first to stop the requests, but even if you don't it should be fine.
In more detail:
frontend.globalRatelimiterMode
) to "disabled", which gets as close as is reasonably possible to acting exactly like it did before this PR."global"
and lowering their RPS back to where we intend, rather than their current artificially-raised-to-mitigate-load-imbalance values.frontend.globalRatelimiterMode
return "global" for keys like.*:my-domain
(to catchuser:my-domain
,worker:my-domain
, etc).constraints: {ratelimitKey: "user:my-domain"}
The changes in a nutshell
(... I guess it's a coconut, given the size)
This PR includes:
quotas.Collection
usage (and are used as a drop-in replacement, though this needed a change to use interfaces).shared.GlobalKey
s are namespaced).disabled
(old code fallthrough),local
(old code with metrics),global
, orx-shadow-y
to use x while shadowing y.shared.GlobalKey
), so in practice we will see things likeuser:domain
<- this is the limiter for "user" requests for that domain, e.g. StartWorkflow RPS. This allows us to roll this out per domain (suffix matches or just compare against all 4 values) and/or per type (prefix matches), so we can adjust to surprises reasonably precisely.quotas
-related code to the newclock.Ratelimiter
APIs as much as possible, which allows some simple wrappers and sharing more logic with otherquotas
package code.rpc
package, and that drives some of this setup, but I'm pretty sure that doesn't make sense with a full plugin-friendly system. So this will almost certainly be moved later.Testing
Aside from the unit tests here, I've locally run all this with the new
development_instance2.yaml
file, made some domains / sent some requests, watched where requests went / how weights changed / when GC occurred / etc. After some bug fixes and the "GC locally after 5 idle periods" change, it seems to be doing exactly what I want it to do, including adjusting as I start and stop the extra instance(s).I would like to build a multi-instance cluster test (or a docker-compose.yaml at the very least) for a variety of kinds of tests, but I wasn't able to find anything that looked promising to build off, and I didn't want to spend a week figuring one out from scratch :\ I'm open to trying if someone has concrete ideas though.
Future changes, roughly in priority order
RatelimitUpdate
for a migrated key will receive all the weight, which is both unfair and may allow exceeding the target RPS. Having aggregators not return data until [update interval] or similar has passed since the first update may be enough to resolve this.golang.org/x/time/rate
needs a PR for its flawed locking.clock.Ratelimiter
likely should not change at all.x/time/rate
will likely still allow time-rewinding, and we'll still need to wrap it and controlreservation.CancelAt
calls / for mocking, and that'll need essentially everything that's currently there.github.com/jonboulle/clockwork
is tough to use with time.Tickers and contexts, and that seems fix-able.ticker.OneShotChan()
API would let us know when a "receive tick -> do something -> go back to waiting" cycle completes, rather than having to sleep and hope it's long enough. Currently we have no real way to work around this.clock.WithTimeout(ctx, dur)
and similar seems rather obviously needed in retrospect, LOTS of time-based stuff uses context timeouts. I have a prototype built but I'm not confident that it's "good enough" to serve as a precise replacement, and we'd need to do something to ensure prod costs are either low enough to accept, or start using build tags to exclude it from prod entirely.membership.Peer
arg tohistory/client.RatelimitUpdate
seems ideal, and is hopefully not too difficult.ValueType
", and some changes to therpc
package to make it generic.