[Bug]: Pipelines sometimes cancel runs

Trenly commented 2 years ago

Brief description of your issue

Occasionally, when submitting PR's the pipelines will cancel the run immediately. This is seen more frequently when submitting multiple PR's in a row

Steps to reproduce

Unkown exactly how to reproduce. Open 10-20 PR's in the span of 20 minutes, and I would expect at least 1 to cancel itself

Expected behavior

PR's will not cancel themselves

Actual behavior

The PR displays as if the checks failed, and must be closed/reopened for the pipelines to be run against it

Environment

N/A

jedieaston commented 2 years ago

I've had this happen even without there being load. I don't know if it's a winget-pkgs specific issue or something with ADO, but the other curious thing is that the run never even really started; there aren't any logs even though it ran for a minute.

Maybe it gave up trying to find a available runner? But that should at least have a error to explain, according to the docs.

vedantmgoyal9 commented 2 years ago

I guess it is a bug in azure-pipelines-bot.

jedieaston commented 2 years ago

It's not the bot because the job is being started in ADO, it's just being cancelled shortly after. It is curious.

Trenly commented 2 years ago

@denelon - Would it be possible to tag this Moderator-Approved since it causes mods to have to tell people to close and reopen their pr for pipeline runs?

Also, since it seems to be a bit more prevalent than I initially thought, would it be possible to get some priority put on an initial investigation, even if it doesn't lead to an immediate fix? Knowing there is a workaround for this makes a solution less critical, but I believe that it may take a few cycles to implement a fix once the root cause is identified, especially if this involves other teams at MSFT

denelon commented 2 years ago

I've asked one of the engineers to take a look at this one.

Trenly commented 2 years ago

I've asked one of the engineers to take a look at this one.

Thank you. I've left as many as I've seen linked for their reference, in case it helps in finding the root cause

Trenly commented 2 years ago

@Denelon - Just curious as to if anything was found or not

denelon commented 2 years ago

It looks like we have a few areas with "flaky" behavior going across the GitHub and Azure DevOps boundaries that make this a bit tricky to troubleshoot. We've added automated monitoring and alerting so we can capture more logs to see if we can find a deeper root cause. Currently, the best practice on our end is to configure multiple retries and we built a job that runs once every two hours to check all open PRs for the status of builds so we can re-trigger them.

OfficialEsco commented 2 years ago

@Trenly's PRs have been untouched for 15 days tho 🤔

denelon commented 2 years ago

Thanks @OfficialEsco! I found 5 older PRs that were stuck and informed the team which ones they were so they can figure out why those weren't being captured with the automated retry. I "/azp run" triggered the builds on them so they should complete now.

ndbeals commented 2 years ago

I was able to re-trigger the pipelines after pushing a single minor commit to #55240 after a little waiting period (though IDK if waiting helped).

If it helps: It seems that the pipelines in #55240 failed because I committed twice straight from the GitHub UI in rapid succession (wingetbot commented /AzurePipelines run twice within seconds). I was accepting changes suggested by a reviewer and there appears to be no "batch suggestion commits together".

Trenly commented 2 years ago

I was able to re-trigger the pipelines after pushing a single minor commit to #55240 after a little waiting period (though IDK if waiting helped).

Any commit will re-trigger, so will closing and reopening the PR, and the bot has a cleaner function which automatically picks up PRs in a bad state and reruns them

If it helps: It seems that the pipelines in #55240 failed because I committed twice straight from the GitHub UI in rapid succession (wingetbot commented /AzurePipelines run twice within seconds).

Commit frequency doesn’t seem to effect it. I’ve had it happen on my very first commit of the day and I’ve also had days where I make hundreds of commits in quick succession but it never happens. Part of this is because it is built around a microservices architecture, so if the message gets dropped at any point in the chain it can cause issues I was accepting changes suggested by a reviewer and there appears to be no "batch suggestion commits together".

If you go to the files tab, you can batch them from there

OfficialEsco commented 2 years ago

IMO i think we can narrow the issue down to it always being the WinGetSvc-Validation (Pull Request Validation) job, and probably the first thing the job does since it stops at <1s https://github.com/microsoft/winget-pkgs/blob/e8b15f89c5e02f143394149ab9c6f808155523cb/DevOpsPipelineDefinitions/validation-pipeline.yaml#L17-L96

Edit: Could this be the culprit even tho its a very low chance? 5 months ago https://github.com/microsoft/winget-pkgs/blame/master/DevOpsPipelineDefinitions/validation-pipeline.yaml#L7-L13 https://github.com/microsoft/winget-pkgs/blob/be181995d95576afd3683fcd27e248623ab62611/DevOpsPipelineDefinitions/validation-pipeline.yaml#L7-L13

Trenly commented 3 months ago

I've had this happen even without there being load.

Just saw this issue again in #164770, so it is still something that happens. Definitely not a lot of load on the system at the time either, which is curious

microsoft / winget-pkgs