uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
339 stars 128 forks source link

Address false positive non-determinism scenarios during replay #1281

Open taylanisikdemir opened 8 months ago

taylanisikdemir commented 8 months ago

What changed? Non-determinism checks during workflow task processing is updated/fixed to properly replay a history and catch non-determinism scenarios that were missed before.

Why? Currently there are some known scenarios where replay/shadow tests cannot capture as a non-deterministic scenario. These false positive test results mislead users and let them deploy non-deterministic workflow changes to prod which causes workflow failures.

Fix Details Fix includes 2 main changes:

  1. When in replay mode, don’t terminate the workflow task processing loop until the whole history is drained in replay mode
  2. When in replay mode, use a modified version of this filter to avoid processing final workflow complete/fail/cancel/continueasnew events because the state machine doesn’t have a corresponding replay for these events and we want the post loop non-deterministic comparison to not fail.

The fix will change the conditions in this critical workflow task loop and it will catch these scenarios not only in Replay Test mode but also actual Replays happening in prod. A new worker option is introduced to turn off new checks if needed.

How did you test it? Added more test cases to replay_test.go and to cover positive/negative cases. There were existing test cases covering current broken behavior (no error when there should be). Those are updated as well.

Potential risks The changes are expected to catch more non-determinism scenarios and this might surface some existing issues in user's workflows. New strict non-determinism checks can be disabled via DisableStrictNonDeterminismCheck worker option if users need time until problematic workflow code is fixed (via versioning).