microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.72k stars 866 forks source link

[BUG]: VMSS agents go offline unexpectedly due to 401 auth error #5023

Open angaaruriakhil opened 5 days ago

angaaruriakhil commented 5 days ago

What happened?

Context: We're hosting VMSS in Azure with images built with runner-images code as Azure DevOps agents. This is the source code for Azure DevOps MS hosted agents. We use the operating systems: Windows 2022, Windows 2019, Ubuntu 22.04, Ubuntu 20.04 and they are all affected by this issue. The issue happens intermittently across scale sets hosted in multiple regions across all the operating systems mentioned.

We intermittently are seeing VMs failing with the error for at least the last month in the Diagnostics tab for each pool in Azure DevOps:

Pipeline agent went offline unexpectedly

Which then will cause the VM to go offline and skip the jobs it may be running. This is causing big problems for us as VMs unexpectedly go offline and if unlucky, this could be while they are running a business critical job. This has been happening for at least a month.

We have checked our proxy/all firewalls for networking blocks and there are no blocks reported from our VMSS subnets to any destination or port. All the outbound traffic being executed is allowed.

Saving an unhealthy Ubuntu 22.04 agent for investigation and investigating the logs under /agent/_diag shows that there is a 401 error (scrubbed excepts attached in log box, I don't want to share sensitive information). See the logs box for relevant logs.

We have similarly looked at the log files under:

All of which report nothing out of the ordinary.

There are similar issues reporting this problem here. https://github.com/microsoft/azure-pipelines-agent/issues/4826

As reported in https://github.com/microsoft/azure-pipelines-agent/issues/4826 , running the ./run.sh --diagnostics command, also reports an error writing to the log.

Versions

Azure DevOps Services Images built with runner-images on October 15th 2024 Azure Pipelines Agent v3.246.0 WA Linux Agent v2.11.1.12

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

Ubuntu 22.04. Ubuntu 20.04, Windows 2022, Windows 2019

Version controll system

No response

Relevant log output

[2024-10-25 10:35:40Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z ERR  MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z INFO MessageListener] Retriable exception: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z ERR  Terminal] WRITE ERROR: 2024-10-25 10:36:15Z: Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
[2024-10-25 11:20:38Z INFO MessageListener] Sleeping for 13.802 seconds before retrying.
[2024-10-25 11:21:42Z INFO MessageListener] Sleeping for 7.404 seconds before retrying.
[2024-10-25 11:21:43Z INFO MessageListener] Sent GetAgentMessage to keep alive agent 926980, session '{scrubbed}'.
[2024-10-25 11:22:13Z WARN VisualStudioServices] Authentication failed with status code 401.

[2024-10-25 11:31:36Z INFO MessageListener] No message retrieved from session '{scrubbed}' within last 30 minutes. Results from ./run.sh --diagnostics:

System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20241025-190446-utc.log' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func2 errorRewriter) at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode) at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize) at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize) at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178 at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50 at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135 at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28 ./run.sh: line 68: 24614 Aborted (core dumped) "$DIR"/bin/Agent.Listener run $* `

DenisNikulin5 commented 1 day ago

Hi @angaaruriakhil The agent should acquire a new token if the old one is expired

Do you see any attempts in the logs before the VM is terminated? It should look like this:

[2024-10-28 12:39:06Z WARN VisualStudioServices] Authentication failed with status code 401.
[2024-10-28 12:39:06Z VERB VisualStudioServices] Started authentication
[2024-10-28 12:39:06Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 is acquiring a token
...
[2024-10-28 12:39:07Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 acquired new token instance 38778824
[2024-10-28 12:39:07Z VERB VisualStudioServices] Finished authentication
[2024-10-28 12:39:07Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 validated token instance 38778824
angaaruriakhil commented 1 day ago

Hi @DenisNikulin5 , No I don't see that in the logs.

It basically loops through this block of code and goes back to 401.

Let me know if more details are required, I scrubbed a lot to protect the details.

[2024-10-24 19:42:05Z WARN VisualStudioServices] Authentication failed with status code 401.
Cache-Control: no-cache
Pragma: no-cache
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
WWW-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/{scrubbed}, Basic realm="https://tfsprodcus6.visualstudio.com/", TFS-Federated
X-TFS-ProcessId: {scrubbed}
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId:  {scrubbed}
X-TFS-Session:  {scrubbed}
X-VSS-E2EID:  {scrubbed}
X-VSS-SenderDeploymentId:  {scrubbed}
X-Frame-Options: SAMEORIGIN
X-TFS-FedAuthRealm: https://tfsprodcus6.visualstudio.com/
X-TFS-FedAuthIssuer: https://dev.azure.com/{scrubbed}
X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/{scrubbed}
X-VSS-ResourceTenant:  {scrubbed}
X-VSS-S2STargetService:  {scrubbed}/visualstudio.com
X-TFS-FedAuthRedirect: https://spsprodcus5.vssps.visualstudio.com/_signin?realm={scrubbed}
Request-Context: appId=cid-v1:{scrubbed}
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-Cache: CONFIG_NOCACHE
X-MSEdge-Ref: Ref A: {scrubbed} Ref B: CHGEDGE1009 Ref C: 2024-10-24T19:42:03Z
Date: Thu, 24 Oct 2024 19:42:03 GMT

[2024-10-24 19:42:03Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:03Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:04Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown
[2024-10-24 19:42:04Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO AgentServer] Establish connection with 60 seconds timeout.
[2024-10-24 19:42:04Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO AgentServer] Establish connection with 60 seconds timeout.
[2024-10-24 19:42:04Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:05Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:05Z INFO CommandSettings] Arg 'work': ''
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'unattended': 'True'
[2024-10-24 19:42:05Z INFO PromptManager] ReadValue
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'disableloguploads': 'False'
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'restreamlogstofiles': 'False'
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'alwaysextracttask': 'False'
[2024-10-24 19:42:05Z INFO ConfigurationStore] Saving agent settings.
[2024-10-24 19:42:05Z INFO ConfigurationStore] Settings Saved.
[2024-10-24 19:42:05Z INFO Terminal] WRITE LINE: 2024-10-24 19:42:05Z: Settings Saved.
[2024-10-24 19:42:05Z INFO SystemDControlManager] Service name 'vsts.agent. {scrubbed}.Scale Set Pool - Linux {scrubbed}.vmsslnap{scrubbed}.service' display name 'Azure Pipelines Agent ({scrubbed}.Scale Set Pool - Linux {scrubbed}.vmsslnap{scrubbed})' will be used for service configuration.
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Root': '/agent'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Root': '/agent'
[2024-10-24 19:42:05Z INFO UnixUtil] Which: 'chmod'
[2024-10-24 19:42:05Z INFO UnixUtil] Location: '{matches.First()}'
[2024-10-24 19:42:05Z INFO UnixUtil] Running /usr/bin/chmod 755 "/agent/svc.sh"
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Starting process:
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   File name: '/usr/bin/chmod'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Arguments: '755 "/agent/svc.sh"'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Working directory: '/agent'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Require exit code zero: 'True'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   High priority process: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   ContinueAfterCancelProcessTreeKillAttempt: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Sigint timeout: '00:00:07.5000000'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Sigterm timeout: '00:00:02.5000000'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Try to use graceful shutdown: False
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Updated oom_score_adj to 500 for PID: 4583.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Process started with process id 4583, waiting for process exit.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Exited process 4583 with exit code 0
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Finished process 4583 with exit code 0, and elapsed time 00:00:00.0023222.
[2024-10-24 19:42:05Z INFO ConfigurationStore] HasCredentials()
[2024-10-24 19:42:05Z INFO ConfigurationStore] stored True
[2024-10-24 19:42:05Z INFO CredentialManager] GetCredentialProvider
[2024-10-24 19:42:05Z INFO CredentialManager] Creating type OAuth
[2024-10-24 19:42:05Z INFO CredentialManager] Creating credential type: OAuth
[2024-10-24 19:42:05Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:05Z INFO ConfigurationManager] LoadSettings
[2024-10-24 19:42:05Z INFO ConfigurationStore] IsConfigured()
[2024-10-24 19:42:05Z INFO ConfigurationStore] IsConfigured: True
[2024-10-24 19:42:05Z INFO ConfigurationManager] Is configured: True
[2024-10-24 19:42:05Z INFO ConfigurationStore] Read setting file: 225 chars
[2024-10-24 19:42:05Z INFO ConfigurationManager] Settings Loaded
[2024-10-24 19:42:05Z WARN VisualStudioServices] Authentication failed with status code 401.