uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
345 stars 131 forks source link

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Open Groxx opened 1 year ago

Groxx commented 1 year ago

This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.

"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. Thankfully our server-side ratelimiters shed that just fine, and the client service didn't have a noticeable cpu increase either (to be fair, they were likely mostly waiting in back-off).

A larger description will come after I get some more sleep, but the quick and dirty summary is:

Ultimately the root issue is that these history-loading requests are not limited at all except by the sticky cache size... which has good reasons to keep as high as is practical. But doing so risks extreme request rates like this.

Decision tasks were regularly >10 minutes, just trying to load history.

So this is an attempt to prevent that from happening. It's not yet complete, just contains the limiter I'm planning, and tests.


My plan for mitigating / solving ^ that explosion is to allow probably 10 history requests per poller. Intentionally badly starving polls that end up requesting a lot of history.

And then do something like NewWeightedPoller(pollWeight: 10, historyWeight: 1, maxResources: 29) to allow 9 history requests even when both polls are running, but let history stop polls if there are a lot of them.

10 / 29 / etc will likely need to be found experimentally tho. That's just a blind guess that's sure to stop the problem, but not necessarily perform well.

davidporter-id-au commented 1 year ago

I would add a brief few lines about why this code is client-side and not server-side similar to how we discussed offline here