[Bug/Question] non_retryable_error_types seems not working

yitian-reevo commented 3 weeks ago

What are you really trying to do?

I'm testing the behavior of RetryPolicy for an activity, esp the non retryable case

Describe the bug

The code I used for testing as following:

Workflow:

from datetime import timedelta

from temporalio import workflow

with workflow.unsafe.imports_passed_through():
    from salestech_be.temporal.activities import (
        say_hello,
    )
    from temporalio.common import RetryPolicy

@workflow.defn
class SayHello:
    @workflow.run
    async def run(self, name: str) -> str:

        return await workflow.execute_activity(
            say_hello, name, start_to_close_timeout=timedelta(seconds=5),
            retry_policy=RetryPolicy(
                maximum_interval=timedelta(seconds=1),
                non_retryable_error_types=["MY_CUSTOM_ERROR_TYPE"],
                maximum_attempts=2,
            )
        )

Activity

from temporalio import activity
from temporalio.exceptions import ApplicationError

@activity.defn
async def say_hello(name: str) -> str:
    raise ApplicationError(
            message="test",
            details={"reason": "test"},
            type="MY_CUSTOM_ERROR_TYPE",
            non_retryable=True,
        )
    return f"Hello, {name}!"

I also tried different combinations of params when raise the error, none of them works.

Test Script:

await client.start_workflow(
    SayHello.run,
    "test_say_hello",
    id=f"test_say_hello_{uuid4()}",
    task_queue=DEFAULT_TASK_QUEUE,
    id_reuse_policy=WorkflowIDReusePolicy.ALLOW_DUPLICATE_FAILED_ONLY,
)

Based on the docs I read, what I expected is this activity should not be retried, and there will be a Activity Task Failed record in the Event history with something like non-retryable error content.

but in fact, the activity is still being retried and fail on a start_to_close timeout eventually, can I ask some help for understanding this? is it a bug, or I misunderstood something? much appreciate!

Minimal Reproduction

see above

Environment/Versions

OS and processor: m2 macbook pro 15.1
Temporal Version: 1.25.0 and SDK version 1.8.0
Are you using Docker or Kubernetes or building Temporal from source? using Docker

Additional context

cretz commented 3 weeks ago

When I run your code I get:

TypeError: ApplicationError.__init__() got an unexpected keyword argument 'details'

This is because details is a vararg. Changing your code to remove message= and details=, the workflow properly fails bubbling out the activity failure.

the activity is still being retried and fail on a start_to_close timeout eventually

Hrmm, this is a bit confusing. A start to close timeout only fails if the activity body exceeds the timeout, but your code raises immediately. Schedule to close or schedule to start timeouts may make sense if you forgot to register your activity or if you don't have a worker running.

yitian-reevo commented 3 weeks ago

thanks, @cretz, yep I find the 'details' errors as well, but things don't change if i removed the details field. And I'm pretty sure I did register the activity and had a working running.

Are you able to reproduce this your side? it should not be two complicated to reproduce, I've confirmed with one of my team and he claims he faces the same issue.

We both use MacOs with M2 chips, not sure if this matters. just fyi.

yitian-reevo commented 3 weeks ago

oh sorry, i missed this line:

Changing your code to remove message= and details=, the workflow properly fails bubbling out the activity failure.

let me try again.

yitian-reevo commented 3 weeks ago

@cretz After hours of debugging, I can confirm this retry not working issue is due to somehow a conflict of our custom logger built on loguru. But haven't have a clue on how to solve it. At least find the cause now. thanks!

temporalio / sdk-python