Demonstrate ExecuteChildWorkflow bug + prepare test for a fix

Groxx commented 2 years ago

A user noticed incorrect cancellation behavior on one of their workflows, which had workflow code somewhat like this:

// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)

When they canceled their parent workflow while "stuff" was still running, it would wait for all the "stuff" to cancel and return (as expected)... ... and then it would start the "final" child, which would never actually finish because the previous "stuff" was canceled, not completed.

For cancellation prior to calling ExecuteChildWorkflow, this can be worked around by checking ctx.Err() == nil, and only executing if true.

For cancellation between ExecuteChildWorkflow and the child being scheduled, there may not be a viable workaround. This time window is thankfully usually very small, so most workflows should not have to worry about it.

The cause appears to be that this cancellation check in ExecuteChildWorkflow depends on childWorkflowExecution being non-nil (since that sends the cancellation event): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917 but that variable is only set when the child workflow's "execution" future completes (i.e. it has been scheduled successfully): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897

If cancellation occurs prior to that point, the cancellation is ignored for this child. Unfortunately it will also not "detect" this "lost" cancellation later in any way, so the child workflow acts as if it was run with a workflow.NewDisconnectedContext, though it was not.

... unfortunately, fixing this can cause non-deterministic replay errors for anyone who had previously executed the child. For some users this is probably fine (just reset), but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or just improvements) in the client, and we do not seem to have an established way to do that yet. Fixing this safely may require us to come up with a strategy, build that, and make use of it.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 1b0d8058-d708-49f6-bfdc-30fa95821ab5

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.01%) to 63.17%

Totals
Change from base Build 23625b2f-f895-4e62-be1e-e45768586dbe:	0.01%
Covered Lines:	12176
Relevant Lines:	19275

💛 - Coveralls

Groxx commented 2 years ago

Gonna close this, as the real fix will be in some other PR (possibly #1144), and they'll likely be merged together.

uber-go / cadence-client

Demonstrate ExecuteChildWorkflow bug + prepare test for a fix #1138

Pull Request Test Coverage Report for Build 1b0d8058-d708-49f6-bfdc-30fa95821ab5

💛 - Coveralls