Closed danthegoodman1 closed 2 months ago
We have recently observed a case where a client had retried nearly 100,000 times and still no error was reported:
temporal_client::retry: gRPC call poll_activity_task_queue retried 94372 times
Even querying the worker's state didn't indicate any problem, so the service's heartbeat (which checks worker state) also didn't fail:
{"runState":"RUNNING","numHeartbeatingActivities":0,"workflowPollerState":"SHUTDOWN","activityPollerState":"POLLING","hasOutstandingWorkflowPoll":false,"hasOutstandingActivityPoll":true,"numCachedWorkflows":0,"numInFlightWorkflowActivations":0,"numInFlightActivities":0}
(Note: This service only handles activities, so workflowPollerState SHUTDOWN was expected.)
This log absolutely needs to be an
ERROR
. Maybe after some threshold, but I've seen it get to >100 when the client was having connectivity issues to the client.I understand metrics might solve this, but client metrics should not be required to run temporal clients reliably.
If a client is unable to get work, after a very long time of trying and failing, that should be considered an error.