jenkinsci / docker-plugin

Jenkins cloud plugin that uses Docker
https://plugins.jenkins.io/docker-plugin/
MIT License
489 stars 318 forks source link

UI shows active (running) agents as suspended #881

Closed fraz3alpha closed 2 years ago

fraz3alpha commented 2 years ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.319.3 OS: Linux - 4.15.0-173-generic --- Parameterized-Remote-Trigger:3.1.5.1 ace-editor:1.1 allure-jenkins-plugin:2.30.2 ansicolor:1.0.1 ant:1.13 antisamy-markup-formatter:2.7 apache-httpcomponents-client-4-api:4.5.13-1.0 appscan:1.0.9 artifactory:3.14.1 authentication-tokens:1.4 authorize-project:1.4.0 aws-credentials:191.vcb_f183ce58b_9 aws-java-sdk:1.12.163-315.v2b_716ec8e4df aws-java-sdk-cloudformation:1.12.163-315.v2b_716ec8e4df aws-java-sdk-codebuild:1.12.163-315.v2b_716ec8e4df aws-java-sdk-ec2:1.12.163-315.v2b_716ec8e4df aws-java-sdk-ecr:1.12.163-315.v2b_716ec8e4df aws-java-sdk-ecs:1.12.163-315.v2b_716ec8e4df aws-java-sdk-elasticbeanstalk:1.12.163-315.v2b_716ec8e4df aws-java-sdk-iam:1.12.163-315.v2b_716ec8e4df aws-java-sdk-logs:1.12.163-315.v2b_716ec8e4df aws-java-sdk-minimal:1.12.163-315.v2b_716ec8e4df aws-java-sdk-ssm:1.12.163-315.v2b_716ec8e4df badge:1.9.1 basic-branch-build-strategies:1.3.2 blueocean:1.25.3 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.25.3 blueocean-commons:1.25.3 blueocean-config:1.25.3 blueocean-core-js:1.25.3 blueocean-dashboard:1.25.3 blueocean-display-url:2.4.1 blueocean-events:1.25.3 blueocean-git-pipeline:1.25.3 blueocean-github-pipeline:1.25.3 blueocean-i18n:1.25.3 blueocean-jira:1.25.3 blueocean-jwt:1.25.3 blueocean-personalization:1.25.3 blueocean-pipeline-api-impl:1.25.3 blueocean-pipeline-editor:1.25.3 blueocean-pipeline-scm-api:1.25.3 blueocean-rest:1.25.3 blueocean-rest-impl:1.25.3 blueocean-web:1.25.3 bootstrap4-api:4.6.0-3 bootstrap5-api:5.1.3-6 bouncycastle-api:2.25 branch-api:2.1044.v2c007e51b_87f build-keeper-plugin:1.3 caffeine-api:2.9.2-29.v717aac953ff3 checks-api:1.7.2 cloud-stats:0.27 cloudbees-bitbucket-branch-source:762.v969cfe087fc0 cloudbees-disk-usage-simple:0.10 cloudbees-folder:6.714.v79e858ef76a_2 command-launcher:1.6 conditional-buildstep:1.4.2 config-file-provider:3.9.0 configuration-as-code:1414.v878271fc496f copyartifact:1.46.3 credentials:1087.v16065d268466 credentials-binding:1.27.1 cvs:2.19 dark-theme:156.v6cf16af6f9ef display-url-api:2.3.5 docker-commons:1.19 docker-java-api:3.1.5.2 docker-plugin:1.2.7 docker-workflow:1.28 downstream-build-cache:1.7 durable-task:495.v29cd95ec10f2 dynamicparameter:0.2.0 echarts-api:5.3.0-2 email-ext:2.87 embeddable-build-status:2.0.3 extended-choice-parameter:346.vd87693c5a_86c extended-read-permission:3.2 external-monitor-job:191.v363d0d1efdf8 favorite:2.4.1 font-awesome-api:6.0.0-1 ghprb:1.42.2 git:4.11.0 git-client:3.11.0 git-server:1.10 git-userContent:1.4 github:1.34.3 github-api:1.301-378.v9807bd746da5 github-branch-source:2.11.4 github-scm-trait-notification-context:1.1 global-slack-notifier:1.5 google-oauth-plugin:1.0.6 gradle:1.38 groovy-postbuild:2.5 handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-1.0 hidden-parameter:0.0.4 htmlpublisher:1.29 http_request:1.14 ibm-application-security:1.3.2 ibm-cloud-devops:2.0.16 ibm-g11n-pipeline:2.0.0 ibm-ucdeploy-publisher:1.2.7 icon-shim:3.0.0 ivy:2.1 jackson2-api:2.13.2-260.v43d711474c77 javadoc:217.v905b_86277a_2a_ javax-activation-api:1.2.0-2 javax-mail-api:1.6.2-5 jdk-tool:1.5 jenkins-design-language:1.25.3 jira:3.7 jjwt-api:0.11.2-9.c8b45b8bb173 jnr-posix-api:3.1.7-3 job-dsl:1.78.3 jobConfigHistory:1133.v0f5420f85053 jobcacher:1.0 jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.6.0-2 jsch:0.1.55.2 junit:1.54 junit-testrail:1.0.7-SNAPSHOT-Gagan kubernetes:3580.v78271e5631dc kubernetes-client-api:5.12.1-187.v577c3e368fb_6 kubernetes-credentials:0.9.0 label-linked-jobs:6.0.1 ldap:2.8 lockable-resources:2.14 login-theme:1.1 logstash:2.5.0205.vd05825ed46bd mailer:408.vd726a_1130320 mapdb-api:1.0.9.0 mask-passwords:3.0 material-theme:0.4.1 matrix-auth:3.1 matrix-project:758.v7a_ea_491852f3 maven-plugin:3.16 mercurial:2.16 metrics:4.1.6.1 momentjs:1.1.1 next-build-number:1.8 oauth-credentials:0.5 okhttp-api:4.9.3-105.vb96869f8ac3a pam-auth:1.7 parameterized-trigger:2.44 pipeline-build-step:2.16 pipeline-github:2.8-138.d766e30bb08b pipeline-graph-analysis:188.v3a01e7973f2c pipeline-input-step:446.vf27b_0b_83500e pipeline-milestone-step:100.v60a_03cd446e1 pipeline-model-api:1.9.3 pipeline-model-declarative-agent:1.1.1 pipeline-model-definition:1.9.3 pipeline-model-extensions:1.9.3 pipeline-rest-api:2.23 pipeline-stage-step:291.vf0a8a7aeeb50 pipeline-stage-tags-metadata:1.9.3 pipeline-stage-view:2.23 pipeline-utility-steps:2.12.0 plain-credentials:1.8 plugin-util-api:2.16.0 popper-api:1.16.1-2 popper2-api:2.11.4-1 postbuild-task:1.9 pubsub-light:1.16 purge-build-queue-plugin:33.v59111a_551b_38 rebuild:1.33 run-condition:1.5 saferestart:0.3 saml:2.296.v0016349946db_ scm-api:595.vd5a_df5eb_0e39 script-security:1138.v8e727069a_025 scriptler:3.4 sidebar-link:2.1.0 simple-theme-plugin:103.va_161d09c38c7 slack:608.v19e3b_44b_b_9ff slack-uploader:1.7 snakeyaml-api:1.29.1 sonar:2.14 sse-gateway:1.25 ssh:2.6.1 ssh-agent:1.24.1 ssh-credentials:1.19 ssh-slaves:1.806.v2253cedd3295 sshd:3.226.vb_1769a_7fb_b_a_6 structs:308.v852b473a2b8c subversion:2.15.3 theme-manager:0.6 timestamper:1.17 token-macro:285.vff7645a_56ff0 translation:1.16 trilead-api:1.57.v6e90e07157e1 uno-choice:2.6.1 variant:1.4 windows-slaves:1.8 workflow-aggregator:2.7 workflow-api:1143.v2d42f1e9dea_5 workflow-basic-steps:2.24 workflow-cps:2660.vb_c0412dc4e6d workflow-cps-global-lib:564.ve62a_4eb_b_e039 workflow-durable-task-step:1128.v8c259d125340 workflow-job:1145.v7f2433caa07f workflow-multibranch:711.vdfef37cda_816 workflow-remote-loader:1.5 workflow-scm-step:2.13 workflow-step-api:622.vb_8e7c15b_c95a_ workflow-support:813.vb_d7c3d2984a_0 yet-another-build-visualizer:1.15 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Jenkins hosted in Kubernetes, agents running in a Docker swarm hosted on Ubuntu 18.04. Connection is via SSH

Reproduction steps

We have recently upgraded to Jenkins 2.332.1 (LTS) across our Jenkins install base, and it appears that those instances that have upgraded the docker plugin from v1.2.6 to v1.2.7 have a UI bug where Jenkins is showing that the agent is in "suspended" state, even though it is running a job.

I have included a couple of screenshots below of two such Jenkins instances:

Screenshot 2022-04-14 at 09 40 42 Screenshot 2022-04-14 at 09 40 32

The jobs are running quite happily, and don't appear bothered that the agent is marked in the UI as suspended.

In the node's log it says it has connected and is online:

...
Evacuated stdout
Agent successfully connected and online

Expected Results

Online agents running work should not be showing as suspended.

Actual Results

Online agents running work are showing as suspended.

Anything else?

Docker v1.2.7 came out on 2022-04-07 I think, and our Jenkins we upgraded on 2022-04-12. Upgraded Jenkins with the older version of the docker plugin seem fine, only when it is upgraded to v1.2.7 it seems to cause a problem. I haven't yet confirmed if it is showing up on every single instance with an upgraded Docker plugin, but it has on every one I have looked at so far.

I don't know whether an un-upgraded version of Jenkins (we were running 2.319.3) with the new plugin has the same issue, as this was only noticed after we upgraded all the instances we have - and I haven't yet had chance to find an older version of Jenkins to check it out.

Jenkins version Docker plugin version state
2.319.3 1.2.6 OK (normal)
2.319.3 1.2.7 ?
2.332.1 1.2.6 OK (normal)
2.332.1 1.2.7 Shows agents as suspended

My colleague was looking at this last night, and the last message I have from him is:

It just seems to be a visual bug, though, if I inspect individual DockerTransientNode instances in the script console they have all the right booleans in the right places

I will work on getting some more information out of the agents state and try and work out which field is responsible - but has anyone else seen this?

fraz3alpha commented 2 years ago

Trying to track down the calling path:

Here is the Jelly for the UI component:

<j:if test="${!c.acceptingTasks}"> <st:nbsp/> (${%suspended})</j:if>

So if I run the following in the script console:

Jenkins.instance.computers.each{
  if (it instanceof io.jenkins.docker.DockerComputer) {
      println "'${it.displayName}' acceptingTasks=${it.acceptingTasks}"
  }
}

return

I am indeed getting false for everything:

'dal-buildx-00027v89s27st' acceptingTasks=false
'dal-buildx-00027vzdkje18' acceptingTasks=false
'dal-buildx-0002855uw7zxk' acceptingTasks=false
'dal-buildx-000285owyea05' acceptingTasks=false
'dal-buildx-000285tkijbcq' acceptingTasks=false
...

despite them all running work.

fraz3alpha commented 2 years ago

I think it's down to this new override code added in the refactor for the retention strategy:

    @Override
    public synchronized boolean isAcceptingTasks(DockerComputer c) {
        return !getTerminateOnceDone();
    }

This is presumably now always returning false for our run-once containers, and the UI is using this to display the state. As long as it is only using this to display the state, then it's going to continue to run jobs, but it definitely looks odd and like it isn't a healthy node.

pjdarton commented 2 years ago

You're 99.99% likely correct. ...but, I've looked through the code and there doesn't appear to be any alternative way of having it answer "No, I'm not accepting anything new now" in any other way aside from that method (that's the only "can you take on more work?" method available). FYI without this code change, folks were reporting bugs best explained by race conditions in the old code; I'm reasonably confident that this code was necessary.

i.e. It is "unfortunate" that the UI displays it in such an unflattering way - it's certainly telling the user things that are rather misleading.

So, unless the core code is written in a way such that it's possible for a plugin (like the docker plugin) to provide extra resources/string-definitions/descriptor-methods to provide "a better interpretation of the situation" (and hence making it possible to fix this by code changes in this plugin), this issue can only be fixed by changing the core UI code.

Personally, I'd suggest that if the agent is online and busy, the UI shouldn't be showing it as "suspended". That kind of UI change to the core code could be a fairly "lightweight" enhancement...

fraz3alpha commented 2 years ago

Thank you for taking a look at this @pjdarton, always appreciated!

I'll see if I can take a look at this again next week, but I don't expect I'll be able to find anything over and above what you've said above, but it would be good for me to see how the different parts work to educate myself a little more.

If nothing else, this issue can serve as an explanation as to why the UI isn't showing what a user would expect it to - I suspect others will come this way after a while.

dnwe commented 2 years ago

👋 I was just about to come and raise this same issue — glad to find myself amongst friends who have already raised it 😎

pjdarton commented 2 years ago

I've had confirmation in other issues that the changes have fixed the race condition whereby containers were being reused when they shouldn't've been, so it wouldn't be right to revert those (necessary) functional changes for the sake some cosmetic appearances (even though I agree that it's very confusing).

However, I have made a note in the changelog about this, in case others come looking. ...and hope that someone (Andy?) might raise a PR against the core UI code to either resolve it there or to make it possible for plugins to control that presentation.