microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.7k stars 856 forks source link

[BUG]: High rate of "We stopped hearing from agent" errors for web-platform-tests. #4313

Open jgraham opened 1 year ago

jgraham commented 1 year ago

What happened?

Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).

One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=101901

Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)

We've tried to resolve the problem in the following ways:

(cc @gsnedders who did most of the diagnosis work to date)

https://github.com/web-platform-tests/wpt/issues/40085 is the corresponding wpt repository issue

Versions

macOS-13

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

No response

Version controll system

No response

Relevant log output

##[error]We stopped hearing from agent Azure Pipelines 11. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Pool: Azure Pipelines
DmitriiBobreshev commented 1 year ago

Hi @jgraham, thank you for the feedback, based on the error message, the issue is not related to the agent itself, but to the ms-hosted pool. Could you please create the issue in the runner-images repository?

Also, to speed up the process, you could create a ticket on dev community?

Blue101black commented 12 months ago

Hi @jgraham did you manage to get any resolution for this?

Windows-2022 we are having same issue. It's very annoying because it's inconsistent and a re-run doesn't always fix it.

ryanps1 commented 7 months ago

Also experiencing this issue with the Microsoft Hosted Ubuntu Pools (I've tried them all)

Az8th commented 2 months ago

We had this problem occuring for several months, and it was fixed by simply turning off auto-updates for agents.

I caught the agent trying to download and install a previous version (the one packaged with its corresponding Azure DevOps version). It seems there is an undocumented behaviour about failing tasks that triggers a backup if the agent was downloaded through another source than Azure (like Github).

Hope it fixes your issue too ;)

patrick-13x commented 1 month ago

We had this problem occuring for several months, and it was fixed by simply turning off auto-updates for agents.

I caught the agent trying to download and install a previous version (the one packaged with its corresponding Azure DevOps version). It seems there is an undocumented behaviour about failing tasks that triggers a backup if the agent was downloaded through another source than Azure (like Github).

Hope it fixes your issue too ;)

How do you manage to turn off auto-updates on Azure DevOps Server 2022?