Closed Groxx closed 3 years ago
Totals | |
---|---|
Change from base Build a7c13abd-4387-454a-9c22-60f832525ca2: | 0.05% |
Covered Lines: | 11734 |
Relevant Lines: | 16590 |
As a sample for when this fails, this new test fails when I comment out the new clearState()
call:
=== RUN TestTaskHandlersTestSuite/TestRegression_QueriesDoNotLeakGoroutines
2021-07-09T20:26:35.644-0700 INFO internal/internal_worker.go:226 No metrics scope configured for cadence worker. Use NoopScope as default.
2021-07-09T20:26:35.644-0700 INFO internal/internal_worker.go:230 No DataConverter configured for cadence worker. Use default one.
internal_task_handlers_test.go:1553:
Error Trace: internal_task_handlers_test.go:1553
Error: Not equal:
expected: 17
actual : 18
Test: TestTaskHandlersTestSuite/TestRegression_QueriesDoNotLeakGoroutines
Messages: expected the query to leak no new goroutines. before query:
found unexpected goroutines:
[Goroutine 179 in state chan receive, with go.uber.org/cadence/internal.(*coroutineState).initialYield on top of the stack:
goroutine 179 [chan receive]:
go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc000450ac0, 0x3, 0xc000624000, 0x22)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:795 +0xe5
go.uber.org/cadence/internal.(*coroutineState).yield(0xc000450ac0, 0xc000624000, 0x22)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:805 +0xb0
go.uber.org/cadence/internal.(*channelImpl).Receive(0xc00044e240, 0x23088a8, 0xc000130480, 0x0, 0x0, 0xc000077150)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:620 +0x4fb
go.uber.org/cadence/internal.TestPanic.func1(0x23088a8, 0xc000130480)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_coroutines_test.go:681 +0x39a
go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine.func1(0x29243c8, 0x1, 0xc000077130, 0x23088a8, 0xc000130480, 0xc000450ac0)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:883 +0x14d
created by go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:874 +0x196
Goroutine 249 in state select, with go.uber.org/cadence/internal.(*cadenceInvoker).heartbeatAndScheduleNextRun.func1 on top of the stack:
......
you get the idea
......
internal_task_handlers_test.go:1487: failed to clean up goroutines.
Original state:
found unexpected goroutines:
[Goroutine 179 in state chan receive, with go.uber.org/cadence/internal.(*coroutineState).initialYield on top of the stack:
goroutine 179 [chan receive]:
go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc000450ac0, 0x3, 0xc000624000, 0x22)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:795 +0xe5
go.uber.org/cadence/internal.(*coroutineState).yield(0xc000450ac0, 0xc000624000, 0x22)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:805 +0xb0
go.uber.org/cadence/internal.(*channelImpl).Receive(0xc00044e240, 0x23088a8, 0xc000130480, 0x0, 0x0, 0xc000077150)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:620 +0x4fb
go.uber.org/cadence/internal.TestPanic.func1(0x23088a8, 0xc000130480)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_coroutines_test.go:681 +0x39a
go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine.func1(0x29243c8, 0x1, 0xc000077130, 0x23088a8, 0xc000130480, 0xc000450ac0)
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:883 +0x14d
created by go.uber.org/cadence/internal.(*dispatcherImpl).newNamedCoroutine
/Users/stevenl/gocode/src/go.uber.org/cadence/internal/internal_workflow.go:874 +0x196
... like before, lots of other stacks
Unfortunately, because there's no structure (these are just error strings), there's no quick way to filter this list down to just the new stack trace. I'm not too concerned though - if we were not leaking goroutines elsewhere in tests, this would be a list of 1 turning into a list of 2, which is trivial to check by hand.
This resolves a bug a user encountered, where a full sticky cache + a query for a non-cached workflow would result in a goroutine leak due to the associated event handler never being shut down. Since this leak retains all in-workflow data, it can eventually lead to an out-of-memory crash, though it does not cause any logic errors (these abandoned goroutines are forever idle once abandoned). There are probably also other scenarios where this is possible.
Since the code paths leading to unlock are rather complex, I've included a DPanic log (panics in debug mode, logs at panic level elsewhere) and metric to check an assumption that I believe to be correct, but have not been able to be entirely confident about. Once this has run in production for a while on a few domains, and verified to not happen, we can likely remove that log and metric.
Separately: this function seems to be far too complex, and is almost certainly duplicating checks made elsewhere, which should not be duplicated like this. There have been multiple issues with state-clearing that have lead to adding conditions to this func, which is a clear sign of a code smell.
State / cache decisions like this should be made in exactly one place ever, and built up as obviously as possible, to ensure gaps like this never occur. In this case we'll likely need to invert the dependency flow somehow, so callers control when cache is cleared based on whether or not it is cached, rather than double-checking internally like this.
We should also probably add go.uber.org/goleak to basically all of our tests, to help ensure we do not have goroutine leaks. This may not have been caught by that, as the steps leading to it are a bit odd and rely on singleton config (sticky cache size), but it may find or prevent others. (we should really rip out these globals too)