temporalio / sdk-java

Temporal Java SDK
https://temporal.io
Apache License 2.0
199 stars 134 forks source link

Test server sometimes fails to include signal in first WFT #2127

Open dandavison opened 4 days ago

dandavison commented 4 days ago

Using the Python SDK, I did

  1. handle = await start_workflow()
  2. await handle.signal()
  3. run worker

Expected Behavior

I expect Python to process a signal_workflow job and then a start_workflow in the activation for the first WFT.

Actual Behavior

Nearly always, we see the expected behavior. Occasionally (on macos-intel builds) Python processes a start_workflow activation job first. Almost certainly this is because the first WFT has no signal in it, although I have not yet investigated further and actually proved that (the test in question exits immediately if it sees start_workflow before signal_workflow).

Steps to Reproduce the Problem

Run the sdk-python test tests/worker/test_workflow.py::test_unfinished_signal_handler_with_workflow_failure applying job under --workflow-environment=time-skipping multiple times on a GitHub macos-intel runner until you see this failure.

Note: There are two variants of the python test; one involves the workflow throwing ApplicationError, and the other involves the client sending a cancel request, again before starting the worker. Interestingly, I've only seen the error described in this ticket for the ApplicationError variant of the test, suggesting that handling the cancel request somehow causes the test server to include all of them in the first WFT, whereas without the cancel request sometimes the signal event is omitted.

See failures in build history of https://github.com/temporalio/sdk-python/pull/556