"Number of retries if task failed" feature is not working as expected

sandykb commented 1 year ago

We have "Number of retries if task failed" option for all the azure devops tasks to re-run the task if it fails.

Actual: In case of Trigger Build task, when we provide the retry count and if the task fails, it triggers the retry count number of executions at the same time (e.g. 3 retry count then three builds) and proceed without waiting for results (even if we have option selected as Wait for completion).

Expected: In case of task failure, it should trigger another run and wait for the result until it reaches max retries.

Error Logs: 2023-03-28T10:09:19.6503322Z ##[warning]RetryHelper encountered task failure, will retry (attempt #: 1 out of 2) after 1000 ms 2023-03-28T10:09:23.2508135Z ##[warning]RetryHelper encountered task failure, will retry (attempt #: 2 out of 2) after 4000 ms

Note: Build still got triggered twice at that same time (4 sec apart) but it did not wait for those to complete.

huserben commented 1 year ago

Hi @sandykb

sorry for the long wait, I was on vacation for the last two weeks.

The "retrying of tasks" is a built-in functionality from Azure Pipelines on all the tasks, not something specific to this task. Thus I do believe I cannot control how this behaves. Note that there are some potential issues when using retries for a task (copied from the docs):

Here are a few things to note when using retries: The failing task is retried immediately. There is no assumption about the idempotency of the task. If the task has side-effects (for instance, if it created an external resource partially), then it may fail the second time it is run. There is no information about the retry count made available to the task. A warning is added to the task logs indicating that it has failed before it is retried. All of the attempts to retry a task are shown in the UI as part of the same task node.

So it could be that you are experiencing some unwanted side-effects.

Anyway I'm with Microsoft when they state:

When you have a flaky task that fails intermittently in a pipeline, you may have to re-run the pipeline to have it succeed. In most cases, the best way to address a flaky task or script is by fixing the task or script itself

So I'm curious, do you know what is causing the task to fail in the first place? Perhaps your problem can be "properly" fixed by removing the root cause of the failure and then you don't have to rely on the retry mechanic.

huserben commented 1 year ago

@sandykb any update for this issue from your side?

huserben / TfsExtensions

"Number of retries if task failed" feature is not working as expected #244