This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.
"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. Thankfully our server-side ratelimiters shed that just fine, and the client service didn't have a noticeable cpu increase either (to be fair, they were likely mostly waiting in back-off).
A larger description will come after I get some more sleep, but the quick and dirty summary is:
they had many "live" workflows
they started to build up a decision-schedule queue
slowing them down
overloading caches, causing a lot of un-cached decisions
... leading to a lot of history iterators in new workflows looping, trying to load history, and getting ratelimited...
... causing more to loop and try to load history...
... slowing things down further and making it worse.
Ultimately the root issue is that these history-loading requests are not limited at all except by the sticky cache size... which has good reasons to keep as high as is practical. But doing so risks extreme request rates like this.
Decision tasks were regularly >10 minutes, just trying to load history.
So this is an attempt to prevent that from happening.
It's not yet complete, just contains the limiter I'm planning, and tests.
My plan for mitigating / solving ^ that explosion is to allow probably 10 history requests per poller. Intentionally badly starving polls that end up requesting a lot of history.
And then do something like NewWeightedPoller(pollWeight: 10, historyWeight: 1, maxResources: 29) to allow 9 history requests even when both polls are running, but let history stop polls if there are a lot of them.
10 / 29 / etc will likely need to be found experimentally tho. That's just a blind guess that's sure to stop the problem, but not necessarily perform well.
This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.
"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. Thankfully our server-side ratelimiters shed that just fine, and the client service didn't have a noticeable cpu increase either (to be fair, they were likely mostly waiting in back-off).
A larger description will come after I get some more sleep, but the quick and dirty summary is:
Ultimately the root issue is that these history-loading requests are not limited at all except by the sticky cache size... which has good reasons to keep as high as is practical. But doing so risks extreme request rates like this.
Decision tasks were regularly >10 minutes, just trying to load history.
So this is an attempt to prevent that from happening. It's not yet complete, just contains the limiter I'm planning, and tests.
My plan for mitigating / solving ^ that explosion is to allow probably 10 history requests per poller. Intentionally badly starving polls that end up requesting a lot of history.
And then do something like
NewWeightedPoller(pollWeight: 10, historyWeight: 1, maxResources: 29)
to allow 9 history requests even when both polls are running, but let history stop polls if there are a lot of them.10 / 29 / etc will likely need to be found experimentally tho. That's just a blind guess that's sure to stop the problem, but not necessarily perform well.