temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
266 stars 70 forks source link

[Bug] Python 1.6.0 query can reset task failure inadvertently via query #786

Closed cretz closed 1 month ago

cretz commented 1 month ago

Describe the bug

Based on a user report, take the following Python replication:

import asyncio
import logging
from uuid import uuid4

from temporalio import workflow
from temporalio.client import Client
from temporalio.worker import Worker

@workflow.defn
class GreetingWorkflow:
    def __init__(self) -> None:
        self._greeting = "<no greeting>"

    @workflow.run
    async def run(self, name: str) -> None:
        # Set the greeting, wait a couple of seconds, then change it
        self._greeting = f"Hello, {name}!"
        await asyncio.sleep(2)
        raise RuntimeError("This is an error")
        self._greeting = f"Goodbye, {name}!"

        # It's ok to end the workflow here. Queries work even after workflow
        # completion.

    @workflow.query
    def greeting(self) -> str:
        return self._greeting

async def main():
    logging.basicConfig(level=logging.DEBUG)

    task_queue = f"tq-{uuid4()}"

    # Start client
    client = await Client.connect("localhost:7233")

    # Run a worker for the workflow
    async with Worker(
        client,
        task_queue=task_queue,
        workflows=[GreetingWorkflow],
    ):

        # While the worker is running, use the client to start the workflow.
        # Note, in many production setups, the client would be in a completely
        # separate process from the worker.
        handle = await client.start_workflow(
            GreetingWorkflow.run,
            "World",
            id=f"hello-query-workflow-id-{uuid4()}",
            task_queue=task_queue,
        )

        # Immediately query
        result = await handle.query(GreetingWorkflow.greeting)
        print(f"First greeting result: {result}")

        # Wait a few of seconds then query again. This works even if the
        # workflow has already completed.
        await asyncio.sleep(3)
        result = await handle.query(GreetingWorkflow.greeting)
        print(f"Second greeting result: {result}")

if __name__ == "__main__":
    asyncio.run(main())

This will reliably replicate the issue. The second query will unset a task failure and eventually succeed. A history is attached as 3ac6a217-8421-44ad-90bc-6ea0edd516d3_events.json.

When debugging, we found that activation failure and eviction was working properly at first. But then when query came in, activation failure was sent to core, but core sent the query anyways, lang responded, only then was eviction sent. Presumably the query (coming via legacy path surely) is given the response server side without the task failure being taken into account.