uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
344 stars 130 forks source link

Prevent local-activity panics from taking down the worker process #1169

Closed Groxx closed 2 years ago

Groxx commented 2 years ago

Previously, a local activity that panicked would lead to a panic exiting the goroutine that performed the call, terminating the whole worker process immediately.

Obviously this is not good behavior. So now that's fixed.

Seems like it has been this way since local activities were introduced, in commit 79c072aea60cbf2ed9bfc47ca85e793a8c07d77b , so it's pretty surprising that it went unnoticed / unfixed for this long. Particularly since it stands a decent chance of killing every worker instance due to decision task retries.


I'm not quite confident that this is done, but I'm putting it up for early review / eyeballs / tests while I read more of the surrounding code.

In particular, a simple and non-breaking (because this was always broken) change that may be worth making is to not do retries on local activities that panic:

func (w *workflowExecutionContextImpl) retryLocalActivity(lar *localActivityResult) bool {
    // add IsPanicError to the list
    if lar.task.retryPolicy == nil || lar.err == nil || IsCanceledError(lar.err) || IsPanicError(lar.err) {
        return false
    }

Since panics are inherently a bit more dangerous / imply a more significant violation of expectations, perhaps it's worth treating them as fatal?

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 01818932-264a-4e65-bb34-c6c092e36756


Files with Coverage Reduction New Missed Lines %
internal/compatibility/thrift/history.go 11 57.65%
<!-- Total: 11 -->
Totals Coverage Status
Change from base Build 01816e39-350f-43ae-a79e-40eaad99261d: 0.1%
Covered Lines: 12404
Relevant Lines: 19429

💛 - Coveralls
Groxx commented 2 years ago

On second thought: nah, allow retries on panics. If they have retries configured, they'll have retries, and the onus is on them to include errRetryPanic in the non-retryable list.