Open jwikman opened 5 months ago
Same here, except in our case the agents will just time out jobs before they generate the script. Reboots do help, but only for a short amount of time before things start hanging up again. Ironic that the WMIC call seems to be meant to measure CPU but running 50 of them in parallel is capping CPU.
Hi @jwikman, thanks for reporting. Looks like it's related to agent tries to get memory usage on each context and since there are a lot of context on your case, it might be produce more WMIC processes than expected. We will continue to investigate
Hi @ismayilov-ismayil I've uploaded some logs to this problem reported on DevCommunity, but they are not public. Can you access them as a MS representative somehow, or can I send you the logs in a non-public way?
I was running a very simple "Hello World" pipeline at the same time that we were experiencing this issue, with System Diagnostics enabled.
I ran v3.232.4 for a few hours. 50+ pipelines with 1000+ tasks being run - All looked very good.
Then I upgraded back to v3.326.1 and now all WMIC processes are back, hundreds of processes are being executed almost all the time. I also see a lot of CPU spikes, that wasn't there on v3.232.4.
After looking into the code, I thought I could set AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS
= false
as a User-defined capabilities for the agents to disable the ResourceUtilizationWarnings that I thought was the reason behind my problems. (I got that from #4317)
But it seems as if that won't work, I still see this in the Initialize job task:
Knob: EnableResourceUtilizationWarnings = true Source: $(AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS)
How do I disable such a feature?
Meanwhile, I'll revert back to v3.232.4.
Last night I updated to v3.238.0 to see if anything has changed. But after three agents was updated, the CPU started to reach 100% CPU 50% of the time. And it seems as if it was due to all the WMIC processes...
@ismayilov-ismayil any progress on this issue? Actually, it seems to be two issues, should I split this one into two?
Right now, I cannot test the first issue, due to the second.
Hi @jwikman We are working on this, you can follow the progress on developer community portal - https://developercommunity.visualstudio.com/t/PowerShell-task-idles-before-generating-/10636846
Ok, Thanks @ismayilov-ismayil. Then I know this got your attention, and we'll continue on the developer community portal.
@jwikman, we rolled back recent changes that may be the RC of the issue. Please, let us know it the problem still bothers you
@KonstantinTyukalov, in which version of the agent can I expect this to be included? And are there a PR where I can see the change? :)
@jwikman it should work with your current agent as we disabled the server-side feature flag We'll prepare a long-term fix soon
@KonstantinTyukalov aah, ok. Was that the AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS knob that I was trying to set? If so, isn't it possible to set those knobs on the agent level? I thought it was possible, after reading #4317...
I'll test to upgrade our agents later tonight.
This is current status:
All agents are now running v3.238.0. The WMIC processes are gone, so I guess that your feature flag change did this. Thanks for that! (Still curious about how I can set those feature flag myself...? 🤔)
So now we can focus on the original problem: The delays when the PowerShell tasks are creating scripts.
After I updated all agents, I still could see a lot of these delays. From 40s to several minutes.
Hence, I had to restart the build server, and after reboot the delays are neglectable again.
Let's see how this progress in a few days.
@jwikman
Still curious about how I can set those feature flag myself...?
Unfortunately, you can't disable this feature by yourself now. We are working on the improvement of this perf monitor, and if it will be possible to toggle manually, we'll let you know.
Regarding the original issue: this looks like a delay of work from powershell handler to prepare a task for execution.
In my repro, I have only ~8 seconds delay between these 2 log lines.
Could you add the VSTSAGENT_TRACE
environment variable to the agent process for verbose tracing and re-send your _diag logs in private messages of the Developer Community ticket?
Unfortunately, you can't disable this feature by yourself now. We are working on the improvement of this perf monitor, and if it will be possible to toggle manually, we'll let you know.
I see, thanks for letting me know.
Regarding the original issue: this looks like a delay of work from powershell handler to prepare a task for execution. In my repro, I have only ~8 seconds delay between these 2 log lines. Could you add the
VSTSAGENT_TRACE
environment variable to the agent process for verbose tracing and re-send your _diag logs in private messages of the Developer Community ticket?
Ok, I've uploaded new logs on the DC ticket.
I created 6 agents in a new pool for this test, on the same server as the other agent pools. Except these 6 agents, there where 8 other agents running other pipelines on the server. All is running v3.238.0. The pipeline I executed ran the same job in parallell on all 6 agents. The Job consists of 3 very simple PowerShell tasks:
I added the variable VSTSAGENT_TRACE
set to true
when I started the pipeline. I also ticked the Enable system diagnostics
option.
I few observations:
Agent environment resources
was printed every 6th second or so.Could the Agent environment resources
checks be causing these delays?
Just let me know if you need anything else.
Hmmm. I run my pipeline once again, but without setting the Enable system diagnostics
option (VSTSAGENT_TRACE
still set to true
).
Now the delays are down to about 40s, and the finishing task is running fine. The total execution time went from over 11 minutes down to 2:40.
Uploaded those logs as well, if the system diagnostics option added too much overhead or something.
Just as an update, almost a week since last reboot - the delays before starting a pipeline task is between 2 to 5 minutes. Running my repro pipeline again at current state, uploading to DevCommunity case later. After this I'll restart the server to get it up to speed again.
Oh, and this maybe can be another clue on this issue:
Today, when everything was really slow, we got an "Azure file copy" (AzureFileCopy@6
) task that started to fail with authentication failures.
It turns out that the Azure file copy task retrieves an authentication token that has a 10-minute lifetime. But the task runs for more than 10 minutes between token retrieval and authentication, causing the authentication to fail.
The logs from that step:
2024-05-13T08:39:22.6335864Z ==============================================================================
2024-05-13T08:39:22.6336111Z Task : Azure file copy
2024-05-13T08:39:22.6336218Z Description : Copy files to Azure Blob Storage or virtual machines
2024-05-13T08:39:22.6336394Z Version : 6.239.5
2024-05-13T08:39:22.6336492Z Author : Microsoft Corporation
2024-05-13T08:39:22.6336629Z Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/deploy/azure-file-copy
2024-05-13T08:39:22.6336842Z ==============================================================================
2024-05-13T08:43:33.4475525Z Added TLS 1.2 in session.
2024-05-13T08:45:30.9273237Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Accounts\2.17.0\Az.Accounts.psd1 -Global
2024-05-13T08:46:49.7738587Z ##[warning]You're using AzureRM which will be retired soon, please schedule an update.
2024-05-13T08:46:50.9308516Z ##[command]Clear-AzContext -Scope CurrentUser -Force -ErrorAction SilentlyContinue
2024-05-13T08:46:51.3628724Z ##[command]Clear-AzContext -Scope Process
2024-05-13T08:46:51.4119670Z ##[command]Connect-AzAccount
2024-05-13T08:46:51.4120435Z Name Value
2024-05-13T08:46:51.4121412Z ---- -----
2024-05-13T08:46:51.4121894Z Tenant {redacted}
2024-05-13T08:46:51.4122198Z Scope Process
2024-05-13T08:46:51.4122582Z FederatedToken {redacted}
2024-05-13T08:46:51.4122980Z Environment AzureCloud
2024-05-13T08:46:51.4123553Z ApplicationId ***
2024-05-13T08:46:51.4126751Z ServicePrincipal True
2024-05-13T08:46:51.4126899Z
2024-05-13T08:46:51.4127011Z
2024-05-13T08:46:51.4127077Z
2024-05-13T08:46:53.3001201Z
2024-05-13T08:46:53.3672456Z ##[command]Set-AzContext
2024-05-13T08:46:53.3672742Z Name Value
2024-05-13T08:46:53.3673156Z ---- -----
2024-05-13T08:46:53.3673635Z SubscriptionId {redacted}
2024-05-13T08:46:53.3673808Z
2024-05-13T08:46:53.3673890Z
2024-05-13T08:46:53.3674847Z
2024-05-13T08:46:54.0284574Z Account SubscriptionName TenantId Environment
2024-05-13T08:46:54.0294558Z ------- ---------------- -------- -----------
2024-05-13T08:46:54.0300991Z *** {redacted} {redacted} AzureCloud
2024-05-13T08:46:54.0301260Z
2024-05-13T08:46:54.0377676Z Name : {redacted} ({redacted}) - {redacted} - {redacted}
2024-05-13T08:46:54.0378386Z {redacted}
2024-05-13T08:46:54.0380214Z Subscription : {redacted}
2024-05-13T08:46:54.0381424Z Account : ***
2024-05-13T08:46:54.0383260Z Environment : AzureCloud
2024-05-13T08:46:54.0385204Z Tenant : {redacted}
2024-05-13T08:46:54.0385602Z TokenCache :
2024-05-13T08:46:54.0387443Z VersionProfile :
2024-05-13T08:46:54.0388721Z ExtendedProperties : {}
2024-05-13T08:46:54.0389129Z
2024-05-13T08:48:08.5776971Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Resources\6.16.1\Az.Resources.psd1 -Global
2024-05-13T09:08:17.7928883Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Storage\6.1.3\Az.Storage.psd1 -Global
2024-05-13T09:10:05.2987485Z ##[error]ClientAssertionCredential authentication failed: A configuration issue is preventing authentication - check the error message from the server for details. You can modify the configuration in the application registration portal. See https://aka.ms/msal-net-invalid-client for details. Original exception: AADSTS700024: Client assertion is not within its valid time range. Current time: 2024-05-13T09:10:05.1641790Z, assertion valid from 2024-05-13T08:46:51.0000000Z, expiry time of assertion 2024-05-13T08:56:51.0000000Z. Review the documentation at https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials . Trace ID: 156bae26-5b86-4a0f-9ddf-6e8d4c480500 Correlation ID: 4e1e6d17-5a52-43c9-835e-c9fcc7880e4a Timestamp: 2024-05-13 09:10:05Z
2024-05-13T09:10:05.3425662Z ##[section]Finishing: Update azure storage with lowercase
As seen above, there are first a delay for 4 minutes before step actually starts. Then we have over 1 minute to import Az.Resources. And then 20(!) minutes to import Az.Storage. (maybe I'm misinterpreting the timestamps, maybe it is the import of Az.Resources that takes 20 minutes...?)
After rebooting server I get these times for the imports (20 + 7 seconds):
2024-05-13T09:31:36.8458808Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Resources\6.16.1\Az.Resources.psd1 -Global
2024-05-13T09:31:55.5007557Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Storage\6.1.3\Az.Storage.psd1 -Global
2024-05-13T09:32:02.8702791Z mime: C:\Agent_AS015_12\_work\_tasks\AzureFileCopy_eb72cb01-a7e5-427b-a8a1-1b31ccac8a43\6.239.5\MimeMapping.json
What happened?
This is reported at https://developercommunity.visualstudio.com/t/PowerShell-task-idles-before-generating-/10636846, but got the suggestion to report here as well.
We’ve got several self hosted Azure DevOps Agent that runs on one of our servers.
We’ve seen an issue that escalated in the last month or so, I believe that it correlates with updating all Agents to latest release (v3.236.1) and at the same time we started to run on PowerShell 7 (with pwsh: true).
After server is started, our pipelines runs fast for several days, or maybe a week or so. Then we start to see performance problems, where a lot of tasks in the pipeline takes a lot longer to complete.
It seems as it is the PowerShell tasks (PowerShell@2) that takes longer and longer to start. When checking the logs we can see that the PowerShell task header is written to the log, then it idles before the text “Generating script” is being logged. Looking at old logs I can see that it used to take about 3 or 4s before “Generating script” is being printed.
But when the server slows down lately this time increases and increases and increases until some colleague yells about it. Yesterday there was about 10 minutes between those two steps.
If I restart the server, all is fine for some days, then this starts again.
When monitoring the server, I can see that the pipeline agents create a LOT (50 to 100 or so) of WMIC processes, which seems to cause high CPU load oon WmiPrvSE.exe… Is this expected behavior? They all just seems to check the available memory on the host.
Yesterday, just before restarting the server, I saw above list of WMIC processes and they just seemed to live on. Today, I still see a lot of WMIC processes being started and we get a CPU spike in WmiPrvSE, but after a few seconds (3 to 4) they are terminated, and CPU goes down.
We’ve got 12 agents running on the same machine, with 32 cores and 128GB of RAM, but since they are divided into two different AgentPools there are most often no more than 6 agents running at the same time.
I believe I found the code that is starting the WMIC processes:
https://github.com/microsoft/azure-pipelines-agent/blob/b34a9c376bf689d70092981eaa40d3e19327b11a/src/Agent.Worker/ResourceMetricsManager.cs#L255
But why it is creating that many processes puzzles me… There is a loop in RunMemoryUtilizationMonitor() that should wait 5s between each check, but as it seems - this cannot work as intended, or am I mistaken?
Versions
Agent version: 3.236.1 Windows Version: Windows Server 2022, v21H2 (OS Build 20348.2402)
Environment type (Please select at least one enviroment where you face this issue)
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
No response
Version controll system
No response
Relevant log output