Creating the cache fails if there is any disconnection between agents and controller

stronk7 commented 1 year ago

Jenkins and plugins versions report

Environment

```text Jenkins: 2.401.1 OS: Linux - 5.4.241-150.347.amzn2.x86_64 Java: 11.0.19 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- all-changes:1.5 ansicolor:1.0.2 ant:487.vd79d090d4ea_e antisamy-markup-formatter:159.v25b_c67cd35fb_ apache-httpcomponents-client-4-api:4.5.14-150.v7a_b_9d17134a_5 authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:191.vcb_f183ce58b_9 aws-java-sdk:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-cloudformation:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-codebuild:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-ec2:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-ecr:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-ecs:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-efs:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-elasticbeanstalk:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-iam:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-kinesis:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-logs:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-minimal:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-sns:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-sqs:1.12.481-392.v8b_291cfcda_09 aws-java-sdk-ssm:1.12.481-392.v8b_291cfcda_09 bootstrap5-api:5.3.0-1 bouncycastle-api:2.28 branch-api:2.1109.vdf225489a_16d build-failure-analyzer:2.4.1 build-timeout:1.31 build-user-vars-plugin:1.9 build-with-parameters:76.v9382db_f78962 caffeine-api:3.1.6-115.vb_8b_b_328e59d8 categorized-view:1.12 checks-api:2.0.0 cloudbees-folder:6.815.v0dd5a_cb_40e0e cobertura:1.17 code-coverage-api:4.7.0 collapsing-console-sections:1.8.0 command-launcher:100.v2f6722292ee8 commons-lang3-api:3.12.0-36.vd97de6465d5b_ commons-text-api:1.10.0-36.vc008c8fcda_7b_ conditional-buildstep:1.4.2 copyartifact:705.v5295cffec284 credentials:1254.vb_96f366e7b_a_d credentials-binding:604.vb_64480b_c56ca_ dashboard-view:2.487.vcf0ff9008a_c0 data-tables-api:1.13.4-2 display-upstream-changes:0.3.2 display-url-api:2.3.7 docker-build-publish:1.4.0 docker-commons:419.v8e3cd84ef49c docker-workflow:563.vd5d2e5c4007f durable-task:507.v050055d0cb_dd echarts-api:5.4.0-5 embeddable-build-status:385.vc95f94e91fb_3 extended-read-permission:53.v6499940139e5 extensible-choice-parameter:1.8.0 external-monitor-job:203.v683c09d993b_9 extra-columns:1.25 font-awesome-api:6.4.0-1 forensics-api:2.3.0 git:5.1.0 git-client:4.4.0 git-parameter:0.9.18 git-server:99.va_0826a_b_cdfa_d google-login:1.7 instance-identity:173.va_37c494ec4e5 ionicons-api:56.v1b_1c8c49374e jackson2-api:2.15.2-350.v0c2f3f8fc595 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javadoc:233.vdc1a_ec702cff javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.8-1 jdk-tool:66.vd8fa_64ee91b_d jersey2-api:2.39.1-2 jira:3.10 jnr-posix-api:3.1.17-1 job-import-plugin:3.6 jobConfigHistory:1212.vd4470d08ff12 jobcacher:387.v41e893017398 jquery:1.12.4-1 jquery3-api:3.7.0-1 jsch:0.2.8-65.v052c39de79b_2 junit:1207.va_09d5100410f junit-attachments:167.vf1d139e316b_3 leastload:3.0.0 locale:314.v22ce953dfe9e lockable-resources:1166.v7ef984f72fe2 mailer:457.v3f72cb_e015e5 matrix-auth:3.1.8 matrix-project:789.v57a_725b_63c79 maven-plugin:3.22 metrics:4.2.18-439.v86a_20b_a_8318b_ mina-sshd-api-common:2.10.0-69.v28e3e36d18eb_ mina-sshd-api-core:2.10.0-69.v28e3e36d18eb_ monitoring:1.94.1 next-executions:179.vb_e011d7e3b_64 nodelabelparameter:1.12.0 notification:1.15 pam-auth:1.10 parameterized-trigger:2.45 pipeline-build-step:496.v2449a_9a_221f2 pipeline-github-lib:42.v0739460cda_c4 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-groovy-lib:656.va_a_ceeb_6ffb_f7 pipeline-input-step:468.va_5db_051498a_4 pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2141.v5402e818a_779 pipeline-model-definition:2.2141.v5402e818a_779 pipeline-model-extensions:2.2141.v5402e818a_779 pipeline-rest-api:2.32 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2141.v5402e818a_779 pipeline-stage-view:2.32 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:3.3.0 prism-api:1.29.0-7 rebuild:320.v5a_0933a_e7d61 resource-disposer:0.22 role-strategy:633.v836e5b_3e80a_5 run-condition:1.5 s3:0.12.3445.vda_704535b_5a_d scm-api:672.v64378a_b_20c60 script-security:1251.vfe552ed55f8d sectioned-view:1.25 snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4 ssh-credentials:305.v8f4381501156 sshd:3.303.vefc7119b_ec23 strict-crumb-issuer:2.1.1 structs:324.va_f5d6774f3a_d swarm:3.40 template-project:1.5.2 thinBackup:1.17 throttle-concurrents:2.13 timestamper:1.25 token-macro:359.vb_cde11682e0c trilead-api:2.84.v72119de229b_7 validating-string-parameter:2.8 variant:59.vf075fe829ccb view-job-filters:364.v48a_33389553d workflow-aggregator:596.v8c21c963d92d workflow-api:1215.v2b_ee3e1b_dd39 workflow-basic-steps:1017.vb_45b_302f0cea_ workflow-cps:3691.v28b_14c465a_b_b_ workflow-durable-task-step:1247.v7f9dfea_b_4fd0 workflow-job:1308.v58d48a_763b_31 workflow-multibranch:756.v891d88f2cd46 workflow-scm-step:415.v434365564324 workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:839.v35e2736cfd5c ws-cleanup:0.45 ws-ws-replacement:1.0.1 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: kubernetes pod running upstream Jenkins image with Debian 11. Agents: iron servers running Ubuntu 22.04.

Reproduction steps

Configure some job, in our case they are some long behat/phpunit runs and, in the pipeline, add the caching. In our case, we use this to cache a simple (100Kb) json file between runs:

        stage("Run Task") {
            steps {
                wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'xterm']) {
                    cache(
                        maxCacheSize: 250,
                        caches: [
                            [$class: 'ArbitraryFileCache', excludes: '', includes: '*.json', path: "timing"]
                        ]
                    ) {
                      // It is not possible to use the registry with DSL yet.
                      script {
                        docker.withRegistry('', 'dockerhub') {
                          sh task.getPathToRunner(env, steps)
                        }
                      }
                    }
                }
            }
        }

Normally, the plugin does its work, both restoring from the cache the json file at the beginning and storing it back at the end. See, for example, this run:

[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Searching cache in job specific caches...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Found cache in job specific caches
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Restoring cache...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache restored in 1880ms
...
...
== Exit summary:
== Exit code: 0
...
...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Creating cache...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache created in 966ms
...
...
Finished: SUCCESS

But, as far as they are very long builds (4 hours or more), sometimes, there is some disconnection between the controller and the agents in the middle. And that's not a problem for the tests, because near always... they continue running and the agent reconnects with the controller automatically.

In those cases, when a disconnection has happened, no matter that the tests have ended ok... there is something in the build post-actions that causes the build to fail. See, for example, this run of the same job above:

[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Searching cache in job specific caches...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Found cache in job specific caches
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Restoring cache...
[Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache restored in 208ms
...
...
...................................................................... 9030
.........Cannot contact xxx.xxx.xxx.xxx.xxx: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@2a344988:JNLP4-connect connection from xxx.xxx.xxx.xxx.xxx": Remote call on JNLP4-connect connection from xxx.xxx.xxx.xxx.xxx failed. The channel is closing down or has closed down
............................................................. 9100
...................................................................... 9170
...................................................................... 9240
...
...
== Exit summary:
== Exit code: 0
...
...
[Pipeline] End of Pipeline
java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 5447d9be-a08c-48f7-97fa-a2a8797719c0
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@2a344988:JNLP4-connect connection from xxx.xxx.xxx.xxx.xxx": Remote call on JNLP4-connect connection from xxx.xxx.xxx.xxx.xxx failed. The channel is closing down or has closed down
at hudson.remoting.Channel.call(Channel.java:993)
at hudson.FilePath.act(FilePath.java:1192)
at hudson.FilePath.act(FilePath.java:1181)
at jenkins.plugins.jobcacher.ArbitraryFileCache$SaverImpl.calculateSize(ArbitraryFileCache.java:354)
at jenkins.plugins.jobcacher.CacheManager.exceedsMaxCacheSize(CacheManager.java:116)
at jenkins.plugins.jobcacher.CacheManager.save(CacheManager.java:77)
at jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.complete(CacheStepExecution.java:93)
at jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.onSuccess(CacheStepExecution.java:74)
at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$SuccessAdapter.receive(CpsBodyExecution.java:371)
at com.cloudbees.groovy.cps.Outcome.resumeFrom(Outcome.java:70)
at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:149)
at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:420)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:330)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:294)
at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Finished: FAILURE

As you can see, the tests end ok, but later, near the end, there is that error which stack-trace points (I think, I may be wrong!) to the JobCacher plugin

Expected Results

The job should end ok and the caches be set normally, no matter the brief disconnection in the middle of the (long) tests run.

Actual Results

Each time (we have been monitoring it since some good time ago and correlation is 1:1) that there is a brief disconnection, the job fails with the information above, no matter the tests themselves have ended ok.

Anything else?

We are still trying some runs not using the plugin, or keeping that json file saved is some other way to be 100% sure if the problem happens only with the plugin (aka, we aren't 100% sure yet). But decided to report it already, because the stack trace really seems to point it it.

We have tried both with S3 and in-controller storage (to discard factors) and the same behaviour happens no matter of the storage configured.

repolevedavaj commented 1 year ago

Hi @stronk7, thanks for the bug report. I totally agree that error seems to come from this plugin. I will create a branch with a change which could solve this issue. Are you able to install a "pre-release" of the plugin on Jenkins to see if it solves the issue?

stronk7 commented 1 year ago

Hi @repolevedavaj,

never have done before, but can try. Although it can be some good time till we have the disconnection problems leading to the current bug. I mean, we don't have the problem daily but whenever, for any reason, the disconnections happen (sort of, when the cable is unplugged, heh).

Thanks for looking to this!

repolevedavaj commented 1 year ago

@stronk7 no worries, just let me know if it solves your issue :) You can download the plugin from here: https://ci.jenkins.io/job/Plugins/job/jobcacher-plugin/job/PR-207/1/artifact/org/jenkins-ci/plugins/jobcacher/388.v2c5fc2012a_89/jobcacher-388.v2c5fc2012a_89.hpi

stronk7 commented 1 year ago

Thanks @repolevedavaj ,

now we are using 388.v2c5fc2012a_89 here. Will keep an eye on all the jobs having disconnection problems and which their (hopefully passing) new outcome is.

Will report back as soon and we have some case, ciao :-)

stronk7 commented 1 year ago

Uhm... the 3 jobs that have finished since I updated the plugin have ended with an ugly java.lang.NullPointerException at then end, without any stack trace. Just guessing if that can happen because the plugin was updated in the middle of their execution.

So I've launched a new (quick, just a few minutes) build to see if that null pointer exception is happening to all the builds, in which case, I'll have to revert to the upstream version. For the records, the new build is this, let's see how it ends:

https://ci.moodle.org/view/B%20-%20401/job/W401.02.01%20-%20Behat%20-%20Goutte%20+%20Postgres/190/console

Ciao :-)

stronk7 commented 1 year ago

Ok, so it seems that new builds (previous comment) are passing ok and only those that were already running when I upgraded the plugin have been caught in the middle.

so I’m going to keep the dev plugin installed, let’s see…

Ciao :)

repolevedavaj commented 1 year ago

Thanks for the update!

stronk7 commented 1 year ago

Hi @repolevedavaj,

it has been a long wait... but I think I come with good news.

We have had at very least a couple of builds where the Cannot contact workerXXX: java.lang.InterruptedException happened in the middle of some long tests that ran ok no matter of the disconnection and then, the post-actions (including the JobCacher one) haven't caused any problem and the console is free from the reported above stack traces, with the job ending, as expected with an nice SUCCESS.

Here there are a couple of links to examples using the 388.v2c5fc2012a_89 version:

So, I'd say that your proposed changes really have fixed the reported problem, and now the JobCacher is immune to those potential disconnections happening in the middle. Great work!

Ciao :-)

repolevedavaj commented 1 year ago

Hi @stronk7 , thanks for the feedback! I merged the change (which will trigger the automatic release) :)

jenkinsci / jobcacher-plugin