uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
344 stars 130 forks source link

Hopeful-mitigation for concurrent map writes on goroutine shutdown #1165

Closed Groxx closed 1 year ago

Groxx commented 2 years ago

Workflow goroutines are shut down concurrently, which can lead to race conditions and crashes. This is an attempt to make them less likely to terminate the hosting process, specifically due to concurrent map writes and reads.

We unambiguously need to fix the underlying issue, but that is substantially more complicated and we're unfortunately pretty busy.


1ms is likely overkill, but even 1us is almost certainly insufficient. There can be non-trivial behavior in deferred functions.

Still, this 1ms is a wild guess, it can be changed or made configurable and it'll likely be fine.


An example stacktrace that this hopefully helps with looks like:

fatal error: concurrent map read and map write

goroutine 4206461 [running]:
runtime.throw({0x266ffb4?, 0x0?})
    GOROOT/src/runtime/panic.go:992 +0x71 fp=0xc016c5af30 sp=0xc016c5af00 pc=0x43e8f1
runtime.mapaccess2(0xc016c5afa8?, 0x7f52ec1cdd20?, 0x8?)
    GOROOT/src/runtime/map.go:476 +0x205 fp=0xc016c5af70 sp=0xc016c5af30 pc=0x4172e5
go.uber.org/cadence/internal.(*decisionsHelper).getDecision(0xc0152c3440, {0x1f83800?, {0xc02295b9c4?, 0x2?}})
    external/org_uber_go_cadence/internal/internal_decision_state_machine.go:691 +0x5c fp=0xc016c5aff8 sp=0xc016c5af70 pc=0xbf6e9c
go.uber.org/cadence/internal.(*decisionsHelper).requestCancelActivityTask(0xc0212e3a01?, {0xc02295b9c4?, 0x1?})
    external/org_uber_go_cadence/internal/internal_decision_state_machine.go:720 +0x26 fp=0xc016c5b038 sp=0xc016c5aff8 pc=0xbf73c6
go.uber.org/cadence/internal.(*workflowEnvironmentImpl).RequestCancelActivity(0xc0152c4c60, {0xc02295b9c4, 0x3})
    external/org_uber_go_cadence/internal/internal_event_handlers.go:501 +0x4a fp=0xc016c5b0d8 sp=0xc016c5b038 pc=0xbfc2aa
go.uber.org/cadence/internal.(*workflowEnvironmentInterceptor).ExecuteActivity.func3({0x211dde0?, 0xc02b1488e8?}, 0xf0?)
    external/org_uber_go_cadence/internal/workflow.go:782 +0x70 fp=0xc016c5b110 sp=0xc016c5b0d8 pc=0xc41050
go.uber.org/cadence/internal.(*channelImpl).Close(0x28?)
    external/org_uber_go_cadence/internal/internal_workflow.go:770 +0x93 fp=0xc016c5b168 sp=0xc016c5b110 pc=0xc2c5d3
go.uber.org/cadence/internal.(*cancelCtx).cancel(0xc0171c0f00, 0x1, {0x2a03300?, 0xc0001978a0})
    external/org_uber_go_cadence/internal/context.go:351 +0xe7 fp=0xc016c5b1c8 sp=0xc016c5b168 pc=0xbefa67
go.uber.org/cadence/internal.NewDisconnectedContext.func1()
    external/org_uber_go_cadence/internal/context.go:209 +0x30 fp=0xc016c5b1f8 sp=0xc016c5b1c8 pc=0xbeef10
runtime.deferCallSave(0xc016c5b288, 0xc008252bb0?)
    GOROOT/src/runtime/panic.go:750 +0x82 fp=0xc016c5b208 sp=0xc016c5b1f8 pc=0x43e0a2
runtime.runOpenDeferFrame(0x10000000040e3ad?, 0xc01bdd5bd0)
    GOROOT/src/runtime/panic.go:723 +0x1a5 fp=0xc016c5b250 sp=0xc016c5b208 pc=0x43dec5
runtime.Goexit()
    GOROOT/src/runtime/panic.go:482 +0x177 fp=0xc016c5b2d0 sp=0xc016c5b250 pc=0x43d5b7
go.uber.org/cadence/internal.(*coroutineState).exit.func1({0xc01081f340?, 0x29fac30?}, 0xc0006ba201?)
    external/org_uber_go_cadence/internal/internal_workflow.go:850 +0x17 fp=0xc016c5b2e0 sp=0xc016c5b2d0 pc=0xc2cdb7
go.uber.org/cadence/internal.(*coroutineState).initialYield(0xc0164e5240, 0x3, {0xc0006ba2c0, 0x1b})
    external/org_uber_go_cadence/internal/internal_workflow.go:797 +0x89 fp=0xc016c5b310 sp=0xc016c5b2e0 pc=0xc2c969
go.uber.org/cadence/internal.(*coroutineState).yield(...)
    external/org_uber_go_cadence/internal/internal_workflow.go:806
go.uber.org/cadence/internal.(*channelImpl).Receive(0xc01b549710, {0x2a29c48, 0xc0171c0f00}, {0x0, 0x0})
    external/org_uber_go_cadence/internal/internal_workflow.go:621 +0x245 fp=0xc016c5b398 sp=0xc016c5b310 pc=0xc2b925
go.uber.org/cadence/internal.(*decodeFutureImpl).Get(0xc0237c4480, {0x2a29c48, 0xc0171c0f00}, {0x0?, 0x0})

We have seen both panics on that decisions map and panics in cancelCtx's children map. There are likely others though, and this can also occur in user's code as this violates a core piece of our execution model.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build f8cb57f3-baf8-45f9-956c-c7767cc8256c


Files with Coverage Reduction New Missed Lines %
internal/internal_task_pollers.go 2 81.64%
<!-- Total: 2 -->
Totals Coverage Status
Change from base Build 2db1895d-7a68-425d-b16f-1dd9cce48c62: 0.1%
Covered Lines: 12399
Relevant Lines: 19441

💛 - Coveralls
davidporter-id-au commented 2 years ago

I'm not sure this diff description matches? This looks like the workflow refresh code

Groxx commented 2 years ago

Hum. I wonder if I pushed the wrong branch or something.

Will fix.

Groxx commented 1 year ago

gonna just close this one - it's still reasonable, but it's also not hard to rebuild.