After some tough-to-identify determinism issues in what appeared to be correct user workflows, and some investigations by both us and them:
This PR resolves a non-deterministic behavior involving child context cancellation propagation, in particular when unblocking selects based on those contexts (possibly transitively, e.g. via activity futures).
As this was previously non-deterministic behavior, both the previous and new code could cause determinism failures after upgrading... but the random execution order previously stood a good chance of failing a few times and then automatically resolving itself. Unfortunately that is not maintained here - failures are likely to be permanent.
Resolving this is... probably not feasible currently. We do not record client-library versions in workflow history, so we cannot maintain backwards compatibility accurately in scenarios like this. We almost certainly should record this on decisions, at least when it changes - we could randomly cancel entries in the list when replaying old decisions, and allow the random behavior to eventually choose a stable execution on a host somewhere.
In any case, for all future workflows this makes behavior deterministic, and should resolve the issue for good.
A full repro can be seen with:
Create multiple cancellable child contexts off a single cancellable parent context, populating its child-context map.
Base some behavior off each child context. Any one-shot logic works, but activities are pretty easy and occur a lot in practice (i.e. waiting on N activities, and being able to cancel many at once).
Block on the selector.
Cancel the parent context. This will:
Cancel the parent context
Propagate that to a random child context
Which will synchronously resolve the future(s) attached to the child context
Which will synchronously trigger any pending callbacks
One of which is a "first call wins" closure which the selector uses to choose which branch to execute
Maintaining the children contexts in an order resolves this, as it ensures the same child is canceled first (then second, etc) each time. Any order should work.
For clearer semantics, I chose to implement it as a compacting FIFO list (as children can remove themselves if they are cancelled independently). This is not noticeably costly (maintenance in a large list will be dwarfed by any side effects of canceling) and it makes it very easy to define and hopefully maintain, as it must not be changed.
This order decision will not be a defined semantic of workflows, however. Cancellation of multiple futures / selector branches should be treated as unordered, and implementing exactly the same behavior in other languages may not be efficient.
In a future implementation it may be worth making selectors choose from any available branch pseudo-randomly, e.g. by run-ID, for the same reason Go explicitly randomizes these behaviors: it prevents accidentally depending on implementation details, by exposing logical flaws sooner.
After some tough-to-identify determinism issues in what appeared to be correct user workflows, and some investigations by both us and them: This PR resolves a non-deterministic behavior involving child context cancellation propagation, in particular when unblocking selects based on those contexts (possibly transitively, e.g. via activity futures).
As this was previously non-deterministic behavior, both the previous and new code could cause determinism failures after upgrading... but the random execution order previously stood a good chance of failing a few times and then automatically resolving itself. Unfortunately that is not maintained here - failures are likely to be permanent.
Resolving this is... probably not feasible currently. We do not record client-library versions in workflow history, so we cannot maintain backwards compatibility accurately in scenarios like this. We almost certainly should record this on decisions, at least when it changes - we could randomly cancel entries in the list when replaying old decisions, and allow the random behavior to eventually choose a stable execution on a host somewhere.
In any case, for all future workflows this makes behavior deterministic, and should resolve the issue for good.
A full repro can be seen with:
Maintaining the children contexts in an order resolves this, as it ensures the same child is canceled first (then second, etc) each time. Any order should work. For clearer semantics, I chose to implement it as a compacting FIFO list (as children can remove themselves if they are cancelled independently). This is not noticeably costly (maintenance in a large list will be dwarfed by any side effects of canceling) and it makes it very easy to define and hopefully maintain, as it must not be changed.
This order decision will not be a defined semantic of workflows, however. Cancellation of multiple futures / selector branches should be treated as unordered, and implementing exactly the same behavior in other languages may not be efficient. In a future implementation it may be worth making selectors choose from any available branch pseudo-randomly, e.g. by run-ID, for the same reason Go explicitly randomizes these behaviors: it prevents accidentally depending on implementation details, by exposing logical flaws sooner.