microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.72k stars 864 forks source link

[Question]: "The HTTP request timed out after 00:01:00" errors on all our self-hosted agents #4330

Closed kquinn-hai closed 3 months ago

kquinn-hai commented 1 year ago

Describe your question

We've noticed that each machine running a self-hosted agent in our environment logs Event 100 for the VstsAgentService source in the Windows event log many times each weekday, and rarely on the weekend. I'm wondering if anyone has an idea what might cause this. This is happening on Server and Workstation operating system and multiple versions of the agent.

The error is:

"Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected."

Each time this happens, the following is logged in the Agent log file':

2023-06-23 13:45:29Z WARN VisualStudioServices] Authentication failed with status code 401. Cache-Control: no-cache Pragma: no-cache P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT" WWW-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/82815b8c-3775-4599-97c7-2b87528e6f83, Basic realm="https://tfsprodcus2.visualstudio.com/", TFS-Federated X-TFS-ProcessId: 45964906-4ee2-4b94-afb8-fe37aaa9bae9 Strict-Transport-Security: max-age=31536000; includeSubDomains ActivityId: 23fc1cfd-d6ea-4f0c-8a0f-dba279207054 X-TFS-Session: 79fd10f0-32d8-4136-bbdb-d6910bdfa8ce X-VSS-E2EID: be22c208-2135-4cf5-b308-3606ed6b62e5 X-VSS-SenderDeploymentId: e3b46be9-fd8c-bada-74e4-ce06f10dcc5d X-Frame-Options: SAMEORIGIN X-TFS-FedAuthRealm: https://tfsprodcus2.visualstudio.com/ X-TFS-FedAuthIssuer: https://dev.azure.com/haigroupdev/ X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/haigroupdev/ X-VSS-ResourceTenant: 82815b8c-3775-4599-97c7-2b87528e6f83 X-VSS-S2STargetService: 00000002-0000-8888-8000-000000000000/visualstudio.com X-TFS-FedAuthRedirect: https://spsprodcus2.vssps.visualstudio.com/_signin?realm=dev.azure.com&reply_to=https%3A%2F%2Fdev.azure.com%2Fhaigroupdev%2F_apis%2Fdistributedtask%2Fpools%2F38%2Fmessages%3FsessionId%3Dd48f3710-c3ba-4390-9e04-7e737c1a32d8&redirect=1&hid=0d1f1ab2-bc98-469f-ad46-94d8d815f49b&context=eyJodCI6MiwiaGlkIjoiOTk0MzBjNTktMDVhOC00YWZjLWE0MjgtNWVlY2FmMTc4ZGQ1IiwicXMiOnt9LCJyciI6IiIsInZoIjoiIiwiY3YiOiIiLCJjcyI6IiJ90#ctx=eyJTaWduSW5Db29raWVEb21haW5zIjpbImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSIsImh0dHBzOi8vbG9naW4ubWljcm9zb2Z0b25saW5lLmNvbSJdfQ2 Request-Context: appId=cid-v1:f0601cad-6271-4f14-ae54-9c8a0b5e0b9c Access-Control-Expose-Headers: Request-Context X-Content-Type-Options: nosniff X-Cache: CONFIG_NOCACHE X-MSEdge-Ref: Ref A: E3D607CE1CE34971AFB5F867A73194D8 Ref B: EWR30EDGE1014 Ref C: 2023-06-23T13:45:30Z Date: Fri, 23 Jun 2023 13:45:30 GMT

[2023-06-23 13:45:29Z INFO RSAEncryptedFileKeyManager] Loading RSA key parameters from file C:\AzDevOpsAgent.HAI01P-Build01-NonProd-Test.credentials_rsaparams [2023-06-23 13:45:30Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown [2023-06-23 14:01:21Z INFO MessageListener] No message retrieved from session 'd48f3710-c3ba-4390-9e04-7e737c1a32d8' within last 30 minutes. [2023-06-23 14:27:23Z WARN VisualStudioServices] GET request to https://dev.azure.com/haigroupdev/_apis/distributedtask/pools/38/messages?sessionId=d48f3710-c3ba-4390-9e04-7e737c1a32d8 timed out after 60 seconds. [2023-06-23 14:27:23Z ERR MessageListener] Catch exception during get next message. [2023-06-23 14:27:23Z ERR MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00. ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request.. ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request. --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter) at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory1 buffer) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.AuthenticationHelper.SendWithNtAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean isProxyAuth, HttpConnection connection, HttpConnectionPool connectionPool, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken) at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) --- End of inner exception stack trace --- at Microsoft.VisualStudio.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Agent.Listener.MessageListener.GetNextMessageAsync(CancellationToken token) in D:\a\1\s\src\Agent.Listener\MessageListener.cs:line 177 [2023-06-23 14:27:23Z INFO MessageListener] Retriable exception: The HTTP request timed out after 00:01:00. [2023-06-23 14:27:23Z ERR Terminal] WRITE ERROR: 2023-06-23 14:27:23Z: Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. [2023-06-23 14:27:23Z INFO AgentServer] Refresh MessageQueue VssConnection to get on a different AFD node. [2023-06-23 14:27:23Z INFO AgentServer] Establish connection with 60 seconds timeout. [2023-06-23 14:27:23Z INFO VisualStudioServices] Starting operation Location.GetConnectionData [2023-06-23 14:27:24Z INFO VisualStudioServices] Finished operation Location.GetConnectionData [2023-06-23 14:27:24Z INFO MessageListener] Sleeping for 28.756 seconds before retrying. [2023-06-23 14:28:43Z INFO Terminal] WRITE LINE: 2023-06-23 14:28:43Z: Agent reconnected. [2023-06-23 14:32:03Z INFO MessageListener] No message retrieved from session 'd48f3710-c3ba-4390-9e04-7e737c1a32d8' within last 30 minutes.

Versions

2.214.1 / WinSvr2019 (Azure VM) 2.214.1 / WinSvr2012 (On-Prem VM) 2.214.1 / WinSvr2016 (On-Prem VM) 3.220.5 / Win10 (On-Prem VM)

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Operation system

WinSvr2019, WinSvr2012, WinSvr2016, Win10

Version controll system

Git

Azure DevOps Server Version (if applicable)

No response

vmapetr commented 1 year ago

Hi @kquinn-hai ! Since the issue appears on several machines and on different agent versions, and I presume the agent is configured to run as a service these errors most likely indicate general network connection issues inside your network or the connection issues caused by proxy or firewall settings. The agent is trying to connect with ADO every minute, and it is expected to get one or two Authentication failed with status code 401. warnings, but in this case, it looks like the connection is getting interrupted due to some network throttling. Could you please check whether the network is stable? Besides that, there is a possibility that the running job utilizes all CPU/Network capabilities, which interrupts the agent connection and leads to timeouts. Could you please check if these errors happened during the pipeline execution, or while the agents are in downtime? Additionally, you can try to apply the HttpTimeoutException workaround - https://github.com/microsoft/azure-pipelines-agent/blob/master/docs/troubleshooting.md#workaround-httptimeoutexception

kquinn-hai commented 1 year ago

Thank you for your reply @vmapetr.

Unfortunately, the firewall doesn't appear to be interrupting/blocking any of this traffic, and it appears the network is stable. We have plenty of on-prem resources reaching out to Azure without issue, although none other than these agents reach out to Azure DevOps as far as I know. Our agents are not over-burdened either. Most of their time is spent idle, and I've observed the error I reported occurring plenty while the machines are consuming a small amount of resources.

I will try the HttpTimeoutException workaround on a couple of our machines to see if there is any noticeable effect.

KhazAkar commented 1 year ago

@kquinn-hai Hello. If you still have this issue, you can try placing .proxy file containing proxy server URL in main folder where agent is installed, for example in C:\Azure. I had same problem as you, and having .proxy file and restarting service helped for me. I hope it will help you as well :)

kquinn-hai commented 1 year ago

@KhazAkar, is this file only present under certain conditions? I don't seem to have one.

KhazAkar commented 1 year ago

@kquinn-hai You need to create it yourself. It contains URL to your company proxy server :)

kquinn-hai commented 1 year ago

@KhazAkar, ah I suspected so. We don't use a proxy server, so I guess this option won't apply. Thanks for the suggestion!

KhazAkar commented 1 year ago

@kquinn-hai then I have another suggestion - do used user for azure pipeline agent have Log On As A Service capability enabled? You can check for that if you login to this PC over RDP and have this connection stay. I assume here, that your agent runs on windows and you use service to keep connection ongoing from PC to Azure.

kquinn-hai commented 1 year ago

@vmapetr, I just wanted to report back and say that I did try the HttpTimeoutException workaround on several of our machines, but it unfortunately doesn't appear to have any effect. I want to make sure I did this correctly, though. Am I correct to understand that these variable values should be stored as OS-level environment variables?

kquinn-hai commented 1 year ago

@KhazAkar, thanks for the additional suggestion. Our pipeline agents do run as Active Directory accounts, and Log On as a Service is already enabled for those accounts.

KhazAkar commented 1 year ago

@kquinn-hai then I don't have no idea. Enabling logon as a service + adding proxy configuration helped in my case, until IT team resets logon as a service capability from account I've plugged in windows service :)

Matthew0x commented 11 months ago

I noticed a similar problem with a totally different setup.

The agent's Linux environment does not utilize the memory nor CPU cores fully in my case. There is still a limit according to the logs. The bandwidth usage is negligible yet the pipeline jobs running will sometimes result in the API timeouts and errors due to not receiving responses in time (from different MS endpoints). Totally randomly, on different request types.

In my case they are used with mostly deployment APIs ran through the agents. The connection lost status seems not to correlate with the API failed responses (perhaps it's coincidental), regardless it is there.

"The agent is trying to connect with ADO every minute, and it is expected to get one or two Authentication failed with status code 401." Is this mechanism by design? Is the agents software supposed to show this log during normal operations? Why would it drop the connectivity, if the connectivity is normally working, is this disruption/delay on the ADO's side?

rlandrum commented 10 months ago

@Matthew0x @kquinn-hai this is a long shot, but are your agents running within a VNET? I've run into this issue when running self hosted ADO agents within an Azure VNET.

We resolved it by deploying a NAT gateway and associating it the respective subnet where the agents reside. This gives a dedicated outbound IP address to the agents when connecting to ADO. We rarely see timeouts now. Prior to this, we'd see multiple timeouts per hour on multiple agents.

Matthew0x commented 10 months ago

Yes, it's based on private network. I think you could be right that assigning a static public IP behind a firewall/NAT could help and be somehow prioritized in the traffic...

I found a piece of documentation that says: "The public IPv4 address used for the access is called the default outbound access IP. This IP is implicit and belongs to Microsoft. This IP address is subject to change and it's not recommended to depend on it for production workloads." Not much explanation provided.

fuyong2022 commented 10 months ago

We also hit the same issue with azure container instance. It takes long time to recover connectivity, causing azure devops pipeline to fail. Meanwhile, when the issue happens, it can't access azure storage account neither.

This confirms it being a source connectivity issue from azure container instance.

peterjonknl commented 9 months ago

I can confirm the timeout behaviour in container instances. Sometimes it took up to 10 mins before a pipeline job was started. Today I configured a NAT gateway with a pool of 8 static outbound IP's so will monitor again if the issue persists. This solution however involves extra costs and increases complexity.

github-actions[bot] commented 3 months ago

This issue has had no activity in 180 days. Please comment if it is not actually stale

AkechiShiro commented 2 months ago

Issue should be reopened as it is still relevant today. @vmapetr

KhazAkar commented 2 months ago

It's permanently relevant.

sudheeshkj commented 2 weeks ago

Still the issue is facing with Azure Container instance image