Trigger task timing out before triggered build is finished

PaulieC1972 commented 2 years ago

Hi,

I am seeing an issue with a couple of pipelines where the triggered builds run longer than 6 hours

Trigger Pipeline job:

` - job: Build_Job dependsOn: GetCredentials_Job variables: PAT: $[ dependencies.GetCredentials_Job.outputs['PAT'] ] pool: vmimage: windows-latest timeoutInMinutes: 1440 steps:

checkout: none
task: benjhuser.tfs-extensions-build-tasks.trigger-build-task.TriggerBuild@3 displayName: Trigger Synthetic Build inputs: definitionIsInCurrentTeamProject: false tfsServer: $(System.CollectionUri) teamProject: 'a4aa557b-7dca-453d-9d83-2fe9271852b9' buildDefinition: 1234 buildParameters: triggeredParentBuildId:$(Build.BuildId),triggeredParentBuildName:$(Build.DefinitionName),triggeredParentProjectId:$(System.TeamProjectId) useSameBranch: false
branchToUse: 'main'
```
waitForQueuedBuildsToFinish: true
waitForQueuedBuildsToFinishRefreshTime: 120
cancelBuildsIfAnyFails: true
storeInEnvironmentVariable: true
authenticationMethod: 'Personal Access Token'
password: '$(PAT)'
treatPartiallySucceededBuildAsSuccessful: false      `
```

The triggered builds are large and can run up to 10 hours, but ADO is pretty uniformly canceling them after 6 hours. I upped the timeoutInMinutes value to 1440 which should be plenty but still getting cut off. Not sure of this is an ADO issue or something I am not specifying here. Anything I can do here from the trigger build side ? Thanks!

PaulieC1972 commented 2 years ago

ugh, code snippet is badly formatted

huserben commented 2 years ago

Hi @PaulieC1972

so let me try to recap your problem to see if I understood it correctly: You trigger a build from one pipeline and want to await it. The triggered build runs for a very long time. Eventually the triggering build is cancelled by ADO?

Please correct me if I understand wrongly.

If that's the case, I'm not sure why it's being cancelled, maybe there is a max timeout, but I'm not sure about this. Just to make sure, there is no "error" as such in the wait task?

If it's a general ADO problem you might be able to circumvent the issue by doing a different "Pipeline" design. Instead of Pipeline A triggering Pipeline B and awaiting it, you could think about having Pipeline A triggering Pipeline B, and Pipeline B triggering a new Pipeline C which is doing whatever Pipeline A was doing after the build would have been finished. But it depends a bit on your specific use case whether this makes sense or not...

PaulieC1972 commented 2 years ago

Thanks for the follow up!

You are correct in your first sentence.

To expand a bit, I have a release pipeline that does several rings of validation against a commit. The first ring triggers 4 test pipelines. If these are successful, the commit is tagged. That kicks off the second ring, which triggers another set of validation pipelines, tags commit and so on and so on. However, some of the validation pipelines, especially in the later rings take a long time to complete. So you might have 3 pipelines running, 2 which finish in 3 hours and the third which take 7-8 hours. The first 2 are finishing fine, but the 3rd gets cancelled after 6 hours.

Splitting them up or chaining them doesn't really solve the issue as the long pole will get cut off regardless. I don't think it's an ADO issue as the long-running pipelines are based off production pipelines which run fine, some for up to 16 hours.

huserben commented 2 years ago

Thanks for clarifying.

I'm gonna make a bold assumption, please correct me if I'm wrong :-) The pipeline that kicks things off is running on a hosted agent (as seen in your code snippet - vmImage: windows-latest) and the other pipelines run on self hosted agents?

Why I believe this is the following explanation in the docs:

To avoid taking up resources when your job is unresponsive or waiting too long, it's a good idea to set a limit on how long your job is allowed to run. Use the job timeout setting to specify the limit in minutes for running the job. Setting the value to zero means that the job can run:

Forever on self-hosted agents

For 360 minutes (6 hours) on Microsoft-hosted agents with a public project and public repository

For 60 minutes on Microsoft-hosted agents with a private project or private repository (unless additional capacity is paid for)

So 360 minutes seems to be the limit for hosted agents. To solve your problem I think you should use a self-hosted agent as they can run forever (set timeout to 0 for maximum time allowed).

My proposal would be to perhaps create a dedicated agent pool and add a new agent in there just to run this build - or at least a job that is just doing the waiting. This can easily be done on any machine as it's basically not needing any resources, so it should be fine to host this agent on a machine that already has another agent registered that does some "heavy lifting".

PaulieC1972 commented 2 years ago

Thanks for the suggestion, I'll look into doing that. I changed things up somewhat today. I was using a release pipeline to trigger a build pipeline that in turn was triggering sub-pipelines. I've changed that to the release pipeline calling each sub-pipeline discretely in it's own stage. Perhaps that will help, we'll see.

huserben commented 2 years ago

Hi @PaulieC1972

shall we then close this issue or do you want to keep it open? As I think the original question (why the timeout) is clarified as it's a restriction by microsoft :-)

PaulieC1972 commented 2 years ago

I just finished a test with another agent pool for a build that lasted 10 hours, so I guess we can close this. Thanks again for the help. I'm loving this tool!

huserben / TfsExtensions

Trigger task timing out before triggered build is finished #212

branchToUse: 'main'