Open Trenly opened 2 years ago
I've had this happen even without there being load. I don't know if it's a winget-pkgs specific issue or something with ADO, but the other curious thing is that the run never even really started; there aren't any logs even though it ran for a minute.
Maybe it gave up trying to find a available runner? But that should at least have a error to explain, according to the docs.
I guess it is a bug in azure-pipelines-bot.
It's not the bot because the job is being started in ADO, it's just being cancelled shortly after. It is curious.
@denelon - Would it be possible to tag this Moderator-Approved
since it causes mods to have to tell people to close and reopen their pr for pipeline runs?
Also, since it seems to be a bit more prevalent than I initially thought, would it be possible to get some priority put on an initial investigation, even if it doesn't lead to an immediate fix? Knowing there is a workaround for this makes a solution less critical, but I believe that it may take a few cycles to implement a fix once the root cause is identified, especially if this involves other teams at MSFT
I've asked one of the engineers to take a look at this one.
I've asked one of the engineers to take a look at this one.
Thank you. I've left as many as I've seen linked for their reference, in case it helps in finding the root cause
@Denelon - Just curious as to if anything was found or not
It looks like we have a few areas with "flaky" behavior going across the GitHub and Azure DevOps boundaries that make this a bit tricky to troubleshoot. We've added automated monitoring and alerting so we can capture more logs to see if we can find a deeper root cause. Currently, the best practice on our end is to configure multiple retries and we built a job that runs once every two hours to check all open PRs for the status of builds so we can re-trigger them.
@Trenly's PRs have been untouched for 15 days tho 🤔
Thanks @OfficialEsco! I found 5 older PRs that were stuck and informed the team which ones they were so they can figure out why those weren't being captured with the automated retry. I "/azp run" triggered the builds on them so they should complete now.
I was able to re-trigger the pipelines after pushing a single minor commit to #55240 after a little waiting period (though IDK if waiting helped).
If it helps: It seems that the pipelines in #55240 failed because I committed twice straight from the GitHub UI in rapid succession (wingetbot commented /AzurePipelines run
twice within seconds). I was accepting changes suggested by a reviewer and there appears to be no "batch suggestion commits together".
I was able to re-trigger the pipelines after pushing a single minor commit to #55240 after a little waiting period (though IDK if waiting helped).
Any commit will re-trigger, so will closing and reopening the PR, and the bot has a cleaner function which automatically picks up PRs in a bad state and reruns them
If it helps: It seems that the pipelines in #55240 failed because I committed twice straight from the GitHub UI in rapid succession (wingetbot commented
/AzurePipelines run
twice within seconds).Commit frequency doesn’t seem to effect it. I’ve had it happen on my very first commit of the day and I’ve also had days where I make hundreds of commits in quick succession but it never happens. Part of this is because it is built around a microservices architecture, so if the message gets dropped at any point in the chain it can cause issues I was accepting changes suggested by a reviewer and there appears to be no "batch suggestion commits together".
If you go to the files tab, you can batch them from there
IMO i think we can narrow the issue down to it always being the WinGetSvc-Validation (Pull Request Validation)
job, and probably the first thing the job does since it stops at <1s
https://github.com/microsoft/winget-pkgs/blob/e8b15f89c5e02f143394149ab9c6f808155523cb/DevOpsPipelineDefinitions/validation-pipeline.yaml#L17-L96
Edit: Could this be the culprit even tho its a very low chance? 5 months ago https://github.com/microsoft/winget-pkgs/blame/master/DevOpsPipelineDefinitions/validation-pipeline.yaml#L7-L13 https://github.com/microsoft/winget-pkgs/blob/be181995d95576afd3683fcd27e248623ab62611/DevOpsPipelineDefinitions/validation-pipeline.yaml#L7-L13
I've had this happen even without there being load.
Just saw this issue again in #164770, so it is still something that happens. Definitely not a lot of load on the system at the time either, which is curious
Brief description of your issue
Occasionally, when submitting PR's the pipelines will cancel the run immediately. This is seen more frequently when submitting multiple PR's in a row
Steps to reproduce
Unkown exactly how to reproduce. Open 10-20 PR's in the span of 20 minutes, and I would expect at least 1 to cancel itself
Expected behavior
PR's will not cancel themselves
Actual behavior
The PR displays as if the checks failed, and must be closed/reopened for the pipelines to be run against it
Environment