Closed CtxIstvans closed 3 years ago
A set of thread dumps that appear to reproduce the same issue.
I was unable to load the configure clouds page, a number of logs about timeout cleaning leaked resources in the system log and no agents were spawning
performanceData.7.output.tar.gz
Sep 06, 2021 9:22:57 AM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute
SEVERE: AzureVMAgentCleanUpTask: execute: Hit timeout while cleaning
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.execute(AzureVMAgentCleanUpTask.java:637)
at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)
at java.base/java.lang.Thread.run(Thread.java:829)
Debug config:
-Djava.util.logging.config.file=<path>/logging.properties
.level=INFO
com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.level=FINE
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINE
java.util.logging.SimpleFormatter.format=[%1$tF %1$tT.%1$tL][%4$s][%2$s] %5$s %6$s%n
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
@CtxIstvans
I've created a PR at https://github.com/jenkinsci/azure-vm-agents-plugin/pull/300
This should help at least. The UI should no longer experience any hanging, I can see in your thread dump that someone tried to mark a VM as offline and that was blocked.
This may also help with some of the other issues
From what I can tell the reason there's so many threads running is that the Azure VM Agents Clean Task
task is getting stuck sometimes.
You can check this in the task log in $JENKINS_HOME/logs/tasks
If you run:
grep 90000 *
You should see:
root@jenkins-0:/var/jenkins_home/logs/tasks# grep 90000 *
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 07:22:57 UTC 2021. 900004ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 07:42:57 UTC 2021. 900002ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 08:02:57 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 08:22:57 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 06:41:32 UTC 2021. 900005ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 07:01:32 UTC 2021. 900002ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 07:41:32 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 09:41:32 UTC 2021. 900009ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 15:01:32 UTC 2021. 900008ms
If you're experiencing the same issue as me, 900000 is 15 minutes in ms which is how long the task will run for before timing out.
From looking at my logs in general the task normally runs for 300ms => 15,000ms. But something is causing it to get stuck occasionally.
The temporary work around appears to be to restart Jenkins.
I've added additional logging in https://github.com/jenkinsci/azure-vm-agents-plugin/pull/300 which should at least show where it's getting stuck
You can add a LogRecorder
for com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
and set it to FINE
.
That won't persist though, if you want it to persist across reboots you can add a config file with the contents:
.level=INFO
com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.level=FINE
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINE
java.util.logging.SimpleFormatter.format=[%1$tF %1$tT.%1$tL][%4$s][%2$s] %5$s %6$s%n
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
and add the VM property -Djava.util.logging.config.file=<path>/logging.properties
to your Jenkins startup script
I deployed it to a couple of Jenkins instances and one of them hit the issue in cleanLeakedResources
:
[2021-09-07 07:41:34.465][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: start
[2021-09-07 07:41:34.466][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: Running clean with 15 minute timeout
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs] AzureVMAgentCleanUpTask: cleanVMs: beginning
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs] AzureVMAgentCleanUpTask: cleanVMs: completed
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments] AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments
[2021-09-07 07:41:34.474][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments] AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments
[2021-09-07 07:41:34.474][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanLeakedResources] AzureVMAgentCleanUpTask: cleanLeakedResources: beginning
[2021-09-07 07:56:34.467][SEVERE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: Hit timeout while cleaning
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.execute(AzureVMAgentCleanUpTask.java:628)
[2021-09-07 07:56:34.468][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: end
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:353)
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:328)
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:608)
at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.lambda$execute$1(AzureVMAgentCleanUpTask.java:618)
So my issue might be different to yours.
I figured out that the metadata service is hanging inside my Jenkins pod on Kubernetes. We're using user assigned managed identities for authentication and Jenkins was not able to get an access token.
(root cause not tracked down yet, MSFT have been looking the last couple of days)
This was the root cause for us: https://github.com/Azure/aad-pod-identity/issues/977
Is it still happening for you?
I'm going to close this given lack of response and that there's been a couple of performance fixes in the plugin recently.
Version report
Jenkins and plugins versions report:
Reproduction steps
Not sure if it can be reproduced, it just happened. Our Jenkins instance incrementally reached a very high CPU load (>100). Based on the thread dump I suspect it is the Azure VM plugin trying to clean up resources using ~360 threads, even though there are no Azure VM agents running. Full thread dump attached.
(This appears in the thread dump ~360 times)
I think this causes the issue as my feeling is that such a cleanup task should run only once at the same time. But it can be something completely different.
Results
Expected result:
No high load.
Actual result:
High load.