uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
345 stars 131 forks source link

Add full locking to workflow.Context to work around shutdown races #1139

Closed Groxx closed 2 years ago

Groxx commented 2 years ago

Regrettable but pretty simple to add. When goroutines are shut down correctly, these locks can (and should) be removed.

.... and then I discovered that Close has synchronous side effects. I'm not super confident that this is stable, I'm not familiar enough with all the code to know for sure :|


Unfortunately it looks like #1124 is not sufficient in practice, there are other races in our users' workflows :\ This is a (nearly) direct mimicry of go's context.go locking in 1.13 (our oldest supported version), but beyond that I have not really thought about the details. It fixes the new test's new race though, at least.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 957cbed2-c422-42c6-bf41-543d48cd82e2


Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/context.go 13 14 92.86%
<!-- Total: 13 14 92.86% -->
Files with Coverage Reduction New Missed Lines %
internal/internal_task_pollers.go 2 80.67%
<!-- Total: 2 -->
Totals Coverage Status
Change from base Build fc712f9a-e71b-4295-8d09-ed2a531ba0d8: 0.04%
Covered Lines: 12180
Relevant Lines: 19281

💛 - Coveralls
Groxx commented 2 years ago

:\ no, I think this may change execution order, given the order of propagation. gonna try something else.

demirkayaender commented 2 years ago

Were you able to repro the issue in your test?

Groxx commented 2 years ago

@demirkayaender not the one I was aiming for, unfortunately. I found yet another instead :|

❯ go test -test.run TestContext_RaceRegression_2 -count 1000 ./internal
fatal error: concurrent map writes

goroutine 648 [running]:
runtime.throw({0x1aa1d1f, 0x1000000017c9779})
    /usr/local/Cellar/go/1.17.1/libexec/src/runtime/panic.go:1198 +0x71 fp=0xc000071a58 sp=0xc000071a28 pc=0x1036e91
runtime.mapassign(0xc0005043c0, 0xc00048e230, 0x0)
    /usr/local/Cellar/go/1.17.1/libexec/src/runtime/map.go:585 +0x4d6 fp=0xc000071ad8 sp=0xc000071a58 pc=0x10105d6
go.uber.org/cadence/internal.propagateCancel({0x1bd7768, 0xc0005043c0}, {0x1bd0620, 0xc00048e230})
    /Users/stevenl/gocode/src/go.uber.org/cadence/internal/context.go:233 +0x11c fp=0xc000071b30 sp=0xc000071ad8 pc=0x169d1dc
go.uber.org/cadence/internal.WithCancel({0x1bd7768, 0xc0005043c0})
    /Users/stevenl/gocode/src/go.uber.org/cadence/internal/context.go:194 +0xbc fp=0xc000071b80 sp=0xc000071b30 pc=0x169ce5c
go.uber.org/cadence/internal.TestContext_RaceRegression_2.func1.1.1()
    /Users/stevenl/gocode/src/go.uber.org/cadence/internal/context_test.go:76 +0x2a fp=0xc000071bb0 sp=0xc000071b80 pc=0x17c0daa
runtime.deferCallSave(0xc000071c78, 0xc000071ef8)
    /usr/local/Cellar/go/1.17.1/libexec/src/runtime/panic.go:950 +0x82 fp=0xc000071bc0 sp=0xc000071bb0 pc=0x1036622
runtime.runOpenDeferFrame(0x100000001006e2d, 0xc00059ee10)
    /usr/local/Cellar/go/1.17.1/libexec/src/runtime/panic.go:889 +0x27b fp=0xc000071c40 sp=0xc000071bc0 pc=0x1035fbb
runtime.Goexit()

I haven't yet reproduced their mapiternext crash.