temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
262 stars 70 forks source link

[Feature Request] poll_workflow_task_queue retried X times needs to be an ERROR log, not WARN #704

Closed danthegoodman1 closed 2 months ago

danthegoodman1 commented 5 months ago
2024-03-13T14:27:20.049568Z  WARN temporal_client::retry: gRPC call poll_workflow_task_queue retried 16 times error=Status { code: Cancelled,
message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }

This log absolutely needs to be an ERROR. Maybe after some threshold, but I've seen it get to >100 when the client was having connectivity issues to the client.

I understand metrics might solve this, but client metrics should not be required to run temporal clients reliably.

If a client is unable to get work, after a very long time of trying and failing, that should be considered an error.

jhecking commented 3 months ago

We have recently observed a case where a client had retried nearly 100,000 times and still no error was reported:

temporal_client::retry: gRPC call poll_activity_task_queue retried 94372 times

Even querying the worker's state didn't indicate any problem, so the service's heartbeat (which checks worker state) also didn't fail:

{"runState":"RUNNING","numHeartbeatingActivities":0,"workflowPollerState":"SHUTDOWN","activityPollerState":"POLLING","hasOutstandingWorkflowPoll":false,"hasOutstandingActivityPoll":true,"numCachedWorkflows":0,"numInFlightWorkflowActivations":0,"numInFlightActivities":0}

(Note: This service only handles activities, so workflowPollerState SHUTDOWN was expected.)