uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
345 stars 131 forks source link

Clear workflow state when not cached and not complete #1111

Closed Groxx closed 3 years ago

Groxx commented 3 years ago

This resolves a bug a user encountered, where a full sticky cache + a query for a non-cached workflow would result in a goroutine leak due to the associated event handler never being shut down. Since this leak retains all in-workflow data, it can eventually lead to an out-of-memory crash, though it does not cause any logic errors (these abandoned goroutines are forever idle once abandoned). There are probably also other scenarios where this is possible.

Since the code paths leading to unlock are rather complex, I've included a DPanic log (panics in debug mode, logs at panic level elsewhere) and metric to check an assumption that I believe to be correct, but have not been able to be entirely confident about. Once this has run in production for a while on a few domains, and verified to not happen, we can likely remove that log and metric.


Separately: this function seems to be far too complex, and is almost certainly duplicating checks made elsewhere, which should not be duplicated like this. There have been multiple issues with state-clearing that have lead to adding conditions to this func, which is a clear sign of a code smell.

State / cache decisions like this should be made in exactly one place ever, and built up as obviously as possible, to ensure gaps like this never occur. In this case we'll likely need to invert the dependency flow somehow, so callers control when cache is cleared based on whether or not it is cached, rather than double-checking internally like this.

We should also probably add go.uber.org/goleak to basically all of our tests, to help ensure we do not have goroutine leaks. This may not have been caught by that, as the steps leading to it are a bit odd and rely on singleton config (sticky cache size), but it may find or prevent others. (we should really rip out these globals too)

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 7ba441d6-73cb-4354-bc1b-96e337cd686e


Totals Coverage Status
Change from base Build a7c13abd-4387-454a-9c22-60f832525ca2: 0.05%
Covered Lines: 11734
Relevant Lines: 16590

💛 - Coveralls
Groxx commented 3 years ago

As a sample for when this fails, this new test fails when I comment out the new clearState() call:

=== RUN   TestTaskHandlersTestSuite/TestRegression_QueriesDoNotLeakGoroutines
2021-07-09T20:26:35.644-0700    INFO    internal/internal_worker.go:226 No metrics scope configured for cadence worker. Use NoopScope as default.
2021-07-09T20:26:35.644-0700    INFO    internal/internal_worker.go:230 No DataConverter configured for cadence worker. Use default one.
    internal_task_handlers_test.go:1553: 
            Error Trace:    internal_task_handlers_test.go:1553
            Error:          Not equal: 
                            expected: 17
                            actual  : 18
            Test:           TestTaskHandlersTestSuite/TestRegression_QueriesDoNotLeakGoroutines
            Messages:       expected the query to leak no new goroutines.  before query:
                            found unexpected goroutines:
                            [Goroutine 179 in state chan receive, with go.uber.org/cadence/internal.(*coroutineState).initialYield on top of the stack:
                            goroutine 179 [chan receive]:
                            go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc000450ac0, 0x3, 0xc000624000, 0x22)
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:795 +0xe5
                            go.uber.org/cadence/internal.(*coroutineState).yield(0xc000450ac0, 0xc000624000, 0x22)
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:805 +0xb0
                            go.uber.org/cadence/internal.(*channelImpl).Receive(0xc00044e240, 0x23088a8, 0xc000130480, 0x0, 0x0, 0xc000077150)
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:620 +0x4fb
                            go.uber.org/cadence/internal.TestPanic.func1(0x23088a8, 0xc000130480)
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_coroutines_test.go:681 +0x39a
                            go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine.func1(0x29243c8, 0x1, 0xc000077130, 0x23088a8, 0xc000130480, 0xc000450ac0)
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:883 +0x14d
                            created by go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine
                                /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:874 +0x196

                             Goroutine 249 in state select, with go.uber.org/cadence/internal.(*cadenceInvoker).heartbeatAndScheduleNextRun.func1 on top of the stack:
                             ......
                             you get the idea
                             ......
    internal_task_handlers_test.go:1487: failed to clean up goroutines.
        Original state:
        found unexpected goroutines:
        [Goroutine 179 in state chan receive, with go.uber.org/cadence/internal.(*coroutineState).initialYield on top of the stack:
        goroutine 179 [chan receive]:
        go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc000450ac0, 0x3, 0xc000624000, 0x22)
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:795 +0xe5
        go.uber.org/cadence/internal.(*coroutineState).yield(0xc000450ac0, 0xc000624000, 0x22)
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:805 +0xb0
        go.uber.org/cadence/internal.(*channelImpl).Receive(0xc00044e240, 0x23088a8, 0xc000130480, 0x0, 0x0, 0xc000077150)
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:620 +0x4fb
        go.uber.org/cadence/internal.TestPanic.func1(0x23088a8, 0xc000130480)
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_coroutines_test.go:681 +0x39a
        go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine.func1(0x29243c8, 0x1, 0xc000077130, 0x23088a8, 0xc000130480, 0xc000450ac0)
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:883 +0x14d
        created by go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine
            /Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:874 +0x196

        ... like before, lots of other stacks

Unfortunately, because there's no structure (these are just error strings), there's no quick way to filter this list down to just the new stack trace. I'm not too concerned though - if we were not leaking goroutines elsewhere in tests, this would be a list of 1 turning into a list of 2, which is trivial to check by hand.