uber-go / cadence-client

Framework for authoring workflows and activities running on top of the Cadence orchestration engine.
https://cadenceworkflow.io
MIT License
344 stars 130 forks source link

Maintain a stable order of children context, resolves a non-determinism around cancels #1183

Closed Groxx closed 1 year ago

Groxx commented 2 years ago

After some tough-to-identify determinism issues in what appeared to be correct user workflows, and some investigations by both us and them: This PR resolves a non-deterministic behavior involving child context cancellation propagation, in particular when unblocking selects based on those contexts (possibly transitively, e.g. via activity futures).

As this was previously non-deterministic behavior, both the previous and new code could cause determinism failures after upgrading... but the random execution order previously stood a good chance of failing a few times and then automatically resolving itself. Unfortunately that is not maintained here - failures are likely to be permanent.

Resolving this is... probably not feasible currently. We do not record client-library versions in workflow history, so we cannot maintain backwards compatibility accurately in scenarios like this. We almost certainly should record this on decisions, at least when it changes - we could randomly cancel entries in the list when replaying old decisions, and allow the random behavior to eventually choose a stable execution on a host somewhere.

In any case, for all future workflows this makes behavior deterministic, and should resolve the issue for good.


A full repro can be seen with:

  1. Create multiple cancellable child contexts off a single cancellable parent context, populating its child-context map.
  2. Base some behavior off each child context. Any one-shot logic works, but activities are pretty easy and occur a lot in practice (i.e. waiting on N activities, and being able to cancel many at once).
  3. Block on the selector.
  4. Cancel the parent context. This will:
    1. Cancel the parent context
    2. Propagate that to a random child context
    3. Which will synchronously resolve the future(s) attached to the child context
    4. Which will synchronously trigger any pending callbacks
    5. One of which is a "first call wins" closure which the selector uses to choose which branch to execute

Maintaining the children contexts in an order resolves this, as it ensures the same child is canceled first (then second, etc) each time. Any order should work. For clearer semantics, I chose to implement it as a compacting FIFO list (as children can remove themselves if they are cancelled independently). This is not noticeably costly (maintenance in a large list will be dwarfed by any side effects of canceling) and it makes it very easy to define and hopefully maintain, as it must not be changed.


This order decision will not be a defined semantic of workflows, however. Cancellation of multiple futures / selector branches should be treated as unordered, and implementing exactly the same behavior in other languages may not be efficient. In a future implementation it may be worth making selectors choose from any available branch pseudo-randomly, e.g. by run-ID, for the same reason Go explicitly randomizes these behaviors: it prevents accidentally depending on implementation details, by exposing logical flaws sooner.

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 01843045-5520-4306-94ac-81baa73093bf


Totals Coverage Status
Change from base Build 01838582-1814-4eb0-a654-48c8844fff22: 0.03%
Covered Lines: 12618
Relevant Lines: 19663

💛 - Coveralls