sartography / SpiffWorkflow

A powerful workflow engine implemented in pure Python
GNU Lesser General Public License v3.0
1.69k stars 313 forks source link

Stalled gateway <+> with incoming condition (v2.0.0rc0) #341

Closed asauve2 closed 1 year ago

asauve2 commented 1 year ago

When adding a new simple unit test for threaded tasks, I stumbled upon a situation when a parallel Gateway section never get executed.

I'm not sure if it is something that is disallowed by the BPMN norm. Anyway neither the Camunda 7 editor, not the spiff-v2.0.0rc0 runner complains about the broken workflow case.

Broken case

This is the simplified case which does not get executed. It is in essence a for loop with the end_condition placed on the last exclusive gateway.

Parallel_gateway_BUGGED_case

A first call to BpmnWorkflow.do_engine_steps() leads to this task states list obtained by printing workflow.get_tasks().

- TASK: <Task object (Root) in state COMPLETED at 0x7fc3869ec700>
    - children=[<Task object (Start) in state COMPLETED at 0x7fc385e36520>]
  SPEC: <SpiffWorkflow.bpmn.specs.control.SimpleBpmnTask object at 0x7fc385e9b160>
- TASK: <Task object (Start) in state COMPLETED at 0x7fc385e36520>
    - children=[<Task object (StartEvent) in state COMPLETED at 0x7fc385e36e80>]
  SPEC: <SpiffWorkflow.bpmn.specs.control.BpmnStartTask object at 0x7fc385ea8d60>
- TASK: <Task object (StartEvent) in state COMPLETED at 0x7fc385e36e80>
    - children=[<Task object (Gateway_1) in state WAITING at 0x7fc385e3cbb0>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.StartEvent object at 0x7fc385e36610>
- TASK: <Task object (Gateway_1) in state WAITING at 0x7fc385e3cbb0>
    - children=[<Task object (Activity_A) in state FUTURE at 0x7fc385e3c1c0>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.ParallelGateway object at 0x7fc385e36940>
- TASK: <Task object (Activity_A) in state FUTURE at 0x7fc385e3c1c0>
    - children=[<Task object (Gateway_2) in state FUTURE at 0x7fc385e3c730>]
  SPEC: PIDOScriptTask(bpmn_id="Activity_A", script="""work.a = work .a + 1
print("IN A")""")
- TASK: <Task object (Gateway_2) in state FUTURE at 0x7fc385e3c730>
    - children=[<Task object (Activity_increment) in state FUTURE at 0x7fc385e3cb50>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.ParallelGateway object at 0x7fc385e36f40>
- TASK: <Task object (Activity_increment) in state FUTURE at 0x7fc385e3cb50>
    - children=[<Task object (Gateway_if) in state FUTURE at 0x7fc385e3cc70>]
  SPEC: PIDOScriptTask(bpmn_id="Activity_increment", script="""work.i_loop = work.i_loop + 1
print("work.i_loop=",work.i_loop)""")
- TASK: <Task object (Gateway_if) in state FUTURE at 0x7fc385e3cc70>
    - children=[<Task object (EndEvent) in state LIKELY at 0x7fc385e3cd00>, <Task object (Gateway_1) in state MAYBE at 0x7fc385e3ccd0>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.ExclusiveGateway object at 0x7fc385e3c4f0>
- TASK: <Task object (EndEvent) in state LIKELY at 0x7fc385e3cd00>
    - children=[<Task object (test_KO.EndJoin) in state LIKELY at 0x7fc385e3cdf0>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.EndEvent object at 0x7fc385e3c820>
- TASK: <Task object (test_KO.EndJoin) in state LIKELY at 0x7fc385e3cdf0>
    - children=[]
  SPEC: <SpiffWorkflow.bpmn.specs.control._EndJoin object at 0x7fc385ea8fd0>
- TASK: <Task object (Gateway_1) in state MAYBE at 0x7fc385e3ccd0>
    - children=[<Task object (Activity_A) in state MAYBE at 0x7fc385e3cd60>]
  SPEC: <SpiffWorkflow.bpmn.specs.defaults.ParallelGateway object at 0x7fc385e36940>
- TASK: <Task object (Activity_A) in state MAYBE at 0x7fc385e3cd60>
    - children=[]
  SPEC: PIDOScriptTask(bpmn_id="Activity_A", script="""work.a = work .a + 1
print("IN A")""")

As one can see, the first parallel gateway remains in state WAITING and Activity_A is never called, even with repeated calls to do_engine_steps().

I don't know if this is an implementation feature but Activity_A and Gateway_1 are duplicated at the end of this list.

Working case

In this second case, the situation is changed, so Gateway_1 has only non-conditional incoming transitions.

Parallel_gateway_WORKING_case

Here the workflow works as intended.

Discussion

I suspect that the locking has something to do with the parallel gateway having some incoming transition from a task (Gateway_if) in FUTURE state, but I have not checked the gateway code yet.

Anyhow, the broken workflow above looks like something that on first intuition should be working. However on second tought and thinking about the gateway as an element that joins everything incoming before proceeding it should not.

If it is not allowed then a parallel gateway with more than one incoming transition should probably raise an exception?

essweine commented 1 year ago

This is replicable with the the following even simpler diagram:

Screenshot_2023-06-19_21-20-18

Instantiated tasks (as opposed to task specs) are always part of a tree (even in the spec contains a loop). When branches of a gateway are merged, spiff looks for any tasks associated with that spec and waits to proceed until all the branches that contain the gateway reach the task.

This causes a problem here, because the the parallel gateway has two direct ancestors: the 'Set x" task and the exclusive gateway, and the latter can't be reached until the gateway is executed once.

I am not sure what the best way to deal with this case is, and resolving the issue will likely take some thought (I can think of a couple of potential solutions, but they'd require some testing to make sure they don't break other cases).

The working case is a decent workaround, but it is reasonable to expect that the diagrams that don't work would work, though this might not turn out to be a very easy problem to fix. We should either figure out how to make it work (the ideal resolution) without breaking anything else, or raise an exception to warn that it won't, as you note.

asauve2 commented 1 year ago

The problem above raised a long and intense discussion yesterday in our own team on the very same topics which started from the assumption that it seems intuitively that the broken case should work.

After doing more researh on documentation material the BPMN norm 2.0.2 indicates in sec. 10.6.4 at page 293:

A Parallel Gateway creates parallel paths without checking any conditions; each outgoing Sequence Flow receives a token upon execution of this Gateway. For incoming flows, the Parallel Gateway will wait for all incoming flows before triggering the flow through its outgoing Sequence Flows

A cross check on Camunda 7 documentation / parallel gateway tells:

An important difference with other gateway types is that the parallel gateway does not evaluate conditions. If conditions are defined on the sequence flow connected with the parallel gateway, they are simply ignored.

The BPMN standard is seemingly ambiguous about the fact that the gateway triggers after waiting incoming flows while at the same time indicating that incoming conditions are ignored. The Camunda 7 documentation provides the same information but insist more on the fact that incoming conditions are simply ignored.

After cooking all that together it looks like a possible implementation in SpiffWorkflow would consist in checking whether an incoming flow has a condition set on it or not. If this is the case it could just be simply ignored for a <+>gateway in state WAITING ? This design looks in fact more consistent as it would behave the same as the working case above that triggers the activity even when an incoming condition is not fulfilled...