jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
46 stars 102 forks source link

plugin causing high load? #295

Closed CtxIstvans closed 3 years ago

CtxIstvans commented 3 years ago

Version report

Jenkins and plugins versions report:

Jenkins: 2.303
OS: Linux - 5.4.0-1055-azure
---
cloudbees-folder:6.15
azure-credentials:182.v3ccd4a755864
blueocean-i18n:1.24.8
external-monitor-job:1.7
configuration-as-code:1.51
ssh-slaves:1.32.0
workflow-aggregator:2.6
pipeline-stage-step:2.5
mapdb-api:1.0.9.0
docker-workflow:1.26
jjwt-api:0.11.2-9.c8b45b8bb173
change-assembly-version-plugin:1.10
jdk-tool:1.5
port-allocator:1.8
plugin-util-api:2.4.0
sse-gateway:1.24
pipeline-model-api:1.8.5
blueocean-pipeline-scm-api:1.24.8
checks-api:1.7.2
cobertura:1.16
oauth-credentials:0.4
popper2-api:2.5.4-3
jquery:1.12.4-1
matrix-project:1.19
basic-branch-build-strategies:1.3.2
blueocean-bitbucket-pipeline:1.24.8
github:1.33.1
test-results-analyzer:0.3.5
bouncycastle-api:2.21
audit-trail:3.10
kubernetes-client-api:5.4.1
credentials:2.5
blueocean-config:1.24.8
kubernetes-credentials:0.9.0
workflow-multibranch:2.26
lockable-resources:2.11
authentication-tokens:1.4
okhttp-api:3.14.9
antisamy-markup-formatter:2.1
google-oauth-plugin:1.0.6
command-launcher:1.6
cloudbees-bitbucket-branch-source:2.9.9
snakeyaml-api:1.29.1
handy-uri-templates-2-api:2.1.8-1.0
blueocean-pipeline-api-impl:1.24.8
mailer:1.34
variant:1.4
handlebars:3.0.8
windows-slaves:1.8
cloud-stats:0.27
ace-editor:1.1
sonar:2.13.1
file-operations:1.11
bootstrap4-api:4.6.0-3
pipeline-utility-steps:2.8.0
git-server:1.10
pipeline-model-definition:1.8.5
blueocean-jwt:1.24.8
jacoco:3.3.0
custom-tools-plugin:0.8
slack:2.48
timestamper:1.13
blueocean-personalization:1.24.8
blueocean-jira:1.24.8
blueocean-git-pipeline:1.24.8
pipeline-stage-view:2.19
extended-read-permission:3.2
trilead-api:1.0.13
workflow-job:2.41
copyartifact:1.46.1
workflow-step-api:2.24
ivy:2.1
pipeline-rest-api:2.19
token-macro:2.15
pubsub-light:1.16
pipeline-graph-analysis:1.11
authorize-project:1.4.0
git-client:3.8.0
momentjs:1.1.1
font-awesome-api:5.15.3-4
bootstrap5-api:5.0.2-1
extended-choice-parameter:0.82
pipeline-input-step:2.12
maven-plugin:3.12
github-api:1.123
favorite:2.3.3
credentials-binding:1.27
apache-httpcomponents-client-4-api:4.5.13-1.0
scm-api:2.6.4
blueocean-rest-impl:1.24.8
echarts-api:5.1.2-3
azure-sdk:25.vcde945b30f00
structs:1.23
blueocean-github-pipeline:1.24.8
workflow-support:3.8
azure-vm-agents:783.v58077630847d
caffeine-api:2.9.1-23.v51c4e2c879c8
junit:1.51
run-condition:1.5
docker-commons:1.17
ldap:2.7
plain-credentials:1.7
jobConfigHistory:2.28.1
pipeline-build-step:2.14
blueocean-autofavorite:1.2.4
matrix-auth:2.6.8
pam-auth:1.6
powershell:1.5
forensics-api:1.2.1
workflow-basic-steps:2.23
metrics:4.0.2.8
mercurial:2.15
blueocean-executor-info:1.24.8
blueocean-commons:1.24.8
config-file-provider:3.8.1
blueocean-pipeline-editor:1.24.8
build-timeout:1.20
javadoc:1.6
gradle:1.37.1
android-emulator:3.1.2
blueocean-events:1.24.8
bitbucket:1.1.29
sshd:3.1.0
jsch:0.1.55.2
workflow-cps-global-lib:2.21
blueocean-web:1.24.8
blueocean:1.24.8
workflow-cps:2.93
workflow-scm-step:2.13
pipeline-model-extensions:1.8.5
jquery-detached:1.2.1
data-tables-api:1.10.25-2
branch-api:2.6.5
artifactory:3.12.5
script-security:1.77
display-url-api:2.3.5
azure-commons:1.1.3
zephyr-for-jira-test-management:1.5
code-coverage-api:1.4.0
http_request:1.9.0
blueocean-rest:1.24.8
ant:1.11
jenkins-design-language:1.24.8
ssh-agent:1.23
subversion:2.14.4
blueocean-core-js:1.24.8
jquery3-api:3.6.0-2
workflow-durable-task-step:2.39
pipeline-milestone-step:1.3.2
popper-api:1.16.1-2
job-dsl:1.77
durable-task:1.38
github-branch-source:2.11.1
kubernetes:1.30.1
blueocean-dashboard:1.24.8
jackson2-api:2.12.4
pipeline-stage-tags-metadata:1.8.5
ssh-credentials:1.19
workflow-api:2.46
git:4.8.0
htmlpublisher:1.25
jira:3.5
blueocean-display-url:2.4.1
Linux Ubuntu 18 LTS

Reproduction steps

Not sure if it can be reproduced, it just happened. Our Jenkins instance incrementally reached a very high CPU load (>100). Based on the thread dump I suspect it is the Azure VM plugin trying to clean up resources using ~360 threads, even though there are no Azure VM agents running. Full thread dump attached.

(This appears in the thread dump ~360 times)

"pool-21-thread-207" #1366201 prio=5 os_prio=0 tid=0x000055deccbe6000 nid=0x31ee runnable [0x00007f3da7d41000]
   java.lang.Thread.State: RUNNABLE
    at java.util.TimeZone.clone(TimeZone.java:738)
    at sun.util.calendar.ZoneInfo.clone(ZoneInfo.java:647)
    at java.util.TimeZone.getDefault(TimeZone.java:625)
    at java.time.ZoneId.systemDefault(ZoneId.java:274)
    at java.time.Clock.systemDefaultZone(Clock.java:178)
    at java.time.OffsetDateTime.now(OffsetDateTime.java:212)
    at com.azure.core.credential.AccessToken.isExpired(AccessToken.java:43)
    at com.azure.core.implementation.AccessTokenCache.lambda$retrieveToken$7(AccessTokenCache.java:124)
    at com.azure.core.implementation.AccessTokenCache$$Lambda$588/1338686956.get(Unknown Source)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:44)
    at reactor.core.publisher.FluxRepeatWhen$RepeatWhenMainSubscriber.resubscribe(FluxRepeatWhen.java:184)
    at reactor.core.publisher.FluxRepeatWhen$RepeatWhenOtherSubscriber.onNext(FluxRepeatWhen.java:239)
    at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:281)
    at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:860)
    at reactor.core.publisher.FluxConcatMap$WeakScalarSubscription.request(FluxConcatMap.java:476)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2193)
    at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:442)
    at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerComplete(FluxConcatMap.java:295)
    at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onComplete(FluxConcatMap.java:884)
    at reactor.core.publisher.FluxConcatMap$WeakScalarSubscription.request(FluxConcatMap.java:477)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2157)
    at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.request(FluxConcatMap.java:329)
    at reactor.core.publisher.Operators$DeferredSubscription.request(Operators.java:1679)
    at reactor.core.publisher.FluxRepeatWhen$RepeatWhenMainSubscriber.onComplete(FluxRepeatWhen.java:164)
    at reactor.core.publisher.Operators.complete(Operators.java:136)
    at reactor.core.publisher.MonoEmpty.subscribe(MonoEmpty.java:45)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
    at reactor.core.publisher.MonoFromFluxOperator.subscribe(MonoFromFluxOperator.java:81)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4150)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:255)
    at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
    at reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)
    at reactor.core.publisher.Flux.subscribe(Flux.java:8235)
    at reactor.core.publisher.FluxExpand$ExpandBreathSubscriber.drainQueue(FluxExpand.java:179)
    at reactor.core.publisher.FluxExpand.subscribeOrReturn(FluxExpand.java:71)
    at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:55)
    at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54)
    at reactor.core.publisher.Flux.subscribe(Flux.java:8235)
    at reactor.core.publisher.FluxExpand$ExpandBreathSubscriber.drainQueue(FluxExpand.java:179)
    at reactor.core.publisher.FluxExpand.subscribeOrReturn(FluxExpand.java:71)
    at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:55)
    at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54)
    at reactor.core.publisher.Flux.subscribe(Flux.java:8235)
    at reactor.core.publisher.Flux.blockLast(Flux.java:2518)
    at com.azure.core.util.paging.ContinuablePagedByIteratorBase.requestPage(ContinuablePagedByIteratorBase.java:94)
    - locked <0x0000000088082408> (a com.azure.core.util.paging.ContinuablePagedByItemIterable$ContinuablePagedByItemIterator)
    at com.azure.core.util.paging.ContinuablePagedByItemIterable$ContinuablePagedByItemIterator.<init>(ContinuablePagedByItemIterable.java:50)
    at com.azure.core.util.paging.ContinuablePagedByItemIterable.iterator(ContinuablePagedByItemIterable.java:37)
    at com.azure.core.util.paging.ContinuablePagedIterable.iterator(ContinuablePagedIterable.java:106)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:367)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:336)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:614)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.access$200(AzureVMAgentCleanUpTask.java:73)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:626)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:623)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I think this causes the issue as my feeling is that such a cleanup task should run only once at the same time. But it can be something completely different.

image

Results

Expected result:

No high load.

Actual result:

High load.

CtxIstvans commented 3 years ago

jenkins-threaddump-20210729-154022-1380.txt

timja commented 3 years ago

A set of thread dumps that appear to reproduce the same issue.

I was unable to load the configure clouds page, a number of logs about timeout cleaning leaked resources in the system log and no agents were spawning

performanceData.7.output.tar.gz

Sep 06, 2021 9:22:57 AM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute
SEVERE: AzureVMAgentCleanUpTask: execute: Hit timeout while cleaning
java.util.concurrent.TimeoutException
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.execute(AzureVMAgentCleanUpTask.java:637)
    at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)
    at java.base/java.lang.Thread.run(Thread.java:829)

Debug config:

-Djava.util.logging.config.file=<path>/logging.properties
.level=INFO
com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.level=FINE

handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINE
java.util.logging.SimpleFormatter.format=[%1$tF %1$tT.%1$tL][%4$s][%2$s] %5$s %6$s%n
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
timja commented 3 years ago

@CtxIstvans

I've created a PR at https://github.com/jenkinsci/azure-vm-agents-plugin/pull/300

This should help at least. The UI should no longer experience any hanging, I can see in your thread dump that someone tried to mark a VM as offline and that was blocked.

This may also help with some of the other issues

From what I can tell the reason there's so many threads running is that the Azure VM Agents Clean Task task is getting stuck sometimes.

You can check this in the task log in $JENKINS_HOME/logs/tasks

If you run:

grep 90000 *

You should see:

root@jenkins-0:/var/jenkins_home/logs/tasks# grep 90000 *
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 07:22:57 UTC 2021. 900004ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 07:42:57 UTC 2021. 900002ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 08:02:57 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.1:Finished at Mon Sep 06 08:22:57 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 06:41:32 UTC 2021. 900005ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 07:01:32 UTC 2021. 900002ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 07:41:32 UTC 2021. 900001ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 09:41:32 UTC 2021. 900009ms
Azure VM Agents Clean Task.log.3:Finished at Thu Sep 02 15:01:32 UTC 2021. 900008ms

If you're experiencing the same issue as me, 900000 is 15 minutes in ms which is how long the task will run for before timing out.

From looking at my logs in general the task normally runs for 300ms => 15,000ms. But something is causing it to get stuck occasionally.

The temporary work around appears to be to restart Jenkins.

I've added additional logging in https://github.com/jenkinsci/azure-vm-agents-plugin/pull/300 which should at least show where it's getting stuck

You can add a LogRecorder for com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask and set it to FINE.

That won't persist though, if you want it to persist across reboots you can add a config file with the contents:

.level=INFO
com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.level=FINE

handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINE
java.util.logging.SimpleFormatter.format=[%1$tF %1$tT.%1$tL][%4$s][%2$s] %5$s %6$s%n
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter

and add the VM property -Djava.util.logging.config.file=<path>/logging.properties to your Jenkins startup script

timja commented 3 years ago

I deployed it to a couple of Jenkins instances and one of them hit the issue in cleanLeakedResources:

[2021-09-07 07:41:34.465][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: start
[2021-09-07 07:41:34.466][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: Running clean with 15 minute timeout
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs] AzureVMAgentCleanUpTask: cleanVMs: beginning
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs] AzureVMAgentCleanUpTask: cleanVMs: completed
[2021-09-07 07:41:34.467][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments] AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments
[2021-09-07 07:41:34.474][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments] AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments
[2021-09-07 07:41:34.474][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanLeakedResources] AzureVMAgentCleanUpTask: cleanLeakedResources: beginning
[2021-09-07 07:56:34.467][SEVERE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: Hit timeout while cleaning
    at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.execute(AzureVMAgentCleanUpTask.java:628)
[2021-09-07 07:56:34.468][FINE][com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute] AzureVMAgentCleanUpTask: execute: end
        at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:353)
        at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanLeakedResources(AzureVMAgentCleanUpTask.java:328)
        at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:608)
        at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.lambda$execute$1(AzureVMAgentCleanUpTask.java:618)
timja commented 3 years ago

So my issue might be different to yours.

I figured out that the metadata service is hanging inside my Jenkins pod on Kubernetes. We're using user assigned managed identities for authentication and Jenkins was not able to get an access token.

(root cause not tracked down yet, MSFT have been looking the last couple of days)

timja commented 3 years ago

This was the root cause for us: https://github.com/Azure/aad-pod-identity/issues/977

Is it still happening for you?

timja commented 3 years ago

I'm going to close this given lack of response and that there's been a couple of performance fixes in the plugin recently.