temporalio / features

Behavior and history compatibility testing for Temporal SDKs
14 stars 17 forks source link

Tag workflow_failed counter metric with exception type #563

Open tsurdilo opened 2 days ago

tsurdilo commented 2 days ago

tag workflow_failed counter metric with the application failure type that caused workflow to fail

this would allow users to filter this metric and omit failure types for which their business logic is expecting to fail the exec, allowing them to still be able to alert on workflow failed counts for unexpected failures

cretz commented 2 days ago

I am not sure we can blindly do this by default. Many people send metrics off to third parties that charge per label/cardinality IIUC and to potentially increase this significantly for a user on SDK upgrade may not be the best thing. We may be able to make it an opt-in option. Also not every failure is an application failure, so would need to know what to set the label to on those.

What we usually tell users that want more than what our metrics offer by default is to make their own metrics. Granted on some SDKs this may not properly catch all types (such as non-determinism in core-based SDKs when configured to make it a failure instead of task failure), but in general an interceptor should work for this case. The benefit of this approach is when a user wants the next label for workflow failure based on something, and then the next, and so on, this can easily be done without altering SDKs.

tsurdilo commented 2 days ago

not sure i fully understand as temporal_activity_execution_failed does include "exception" field that includes the exception type thrown on activity failure (in java sdk at least). not sure why we could not do the same thing for workflow failures?

cretz commented 1 day ago

I think that is only in Java and it's not documented at https://docs.temporal.io/references/sdk-metrics#activity_execution_failed. It's not that we can't add this, it's that we need to be cautious about adding by default and increasing label counts and cardinality on users that simple do an SDK upgrade. In the meantime, users can, for the most part, emit their own metrics with whatever error information they want (including the type). Also there's a difference between "error/exception data type" and "application failure error type" that would have to be discussed too.

tsurdilo commented 1 day ago

got it. ok its just that this use case comes up a lot. basically users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures. but then that skews their completion metrics too. and workarounds of custom metrics and/or custom search attributes are not really good workarounds at all (at least havent been in past when we suggested to users)

Sushisource commented 1 day ago

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

longquanzheng commented 1 day ago

I actually really wanted to add some tags to the metrics emitted by SDK (and optional). It saved a lot of time if we had this

Thanks, Quanzheng

On Thu, Nov 21, 2024 at 4:58 PM Spencer Judge @.***> wrote:

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

— Reply to this email directly, view it on GitHub https://github.com/temporalio/features/issues/563#issuecomment-2492655642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCQPM32KICCKFPK4H4G4TL2BZ6SPAVCNFSM6AAAAABSFUEOBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJSGY2TKNRUGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cretz commented 1 day ago

add some tags

Note this is about a single, predefined tag on a single metric that we probably cannot even enable by default, not general purpose addition of tags to SDK metrics. Many metrics are emitted throughout the SDK and even with the metrics abstraction, context is not provided at metrics record time. If possible, any custom metrics needs may need to be done with custom metrics. Granted there are parts of the internal code we don't have user callbacks/interceptors within.

tsurdilo commented 23 hours ago

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

as much as id like to agree with this..well i do, but often its more difficult than that.

its currently not possible to filter executions (visibility api nor metrics) for completed workflows by their result value. this then forces users to use custom search attributes and upsert to some business "failure" value, which on cloud costs an action too. then they no longer can alert on this via metrics either, and have to resort to visibility api, so all in all, to me this is more of a philosophical question, but was told many times by users that they actually want to fail execution in case of business failures. idk whats best but as things are currently, having exception type in workflow failed metric would be useful imho.

the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall

gandhisamay commented 9 hours ago

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

@Sushisource @tsurdilo if there are scenarios where retrying doesn't make any sense, simply because the activity responded with a bad request, in that case what is the ideal way to stop the workflow execution? Currently what we do is, we set the property non_retryable=true and throw an application failure. Is there any better way of tackling these, so that we can generate correct metrics for that as well?

drewhoskins-temporal commented 6 hours ago

Would it help to be able to specify an exception's severity (info, warn, error, critical or the like), tag it in metrics, and then filter that tag out? That would address the metrics cardinality issue.