Creating the cache fails if there is any disconnection between agents and controller #206

stronk7 commented 1 year ago

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: kubernetes pod running upstream Jenkins image with Debian 11. Agents: iron servers running Ubuntu 22.04.

Reproduction steps

  1. Configure some job, in our case they are some long behat/phpunit runs and, in the pipeline, add the caching. In our case, we use this to cache a simple (100Kb) json file between runs:
            stage("Run Task") {
                steps {
                    wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'xterm']) {
                            maxCacheSize: 250,
                            caches: [
                                [$class: 'ArbitraryFileCache', excludes: '', includes: '*.json', path: "timing"]
                        ) {
                          // It is not possible to use the registry with DSL yet.
                          script {
                            docker.withRegistry('', 'dockerhub') {
                              sh task.getPathToRunner(env, steps)
  2. Normally, the plugin does its work, both restoring from the cache the json file at the beginning and storing it back at the end. See, for example, this run:
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Searching cache in job specific caches...
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Found cache in job specific caches
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Restoring cache...
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache restored in 1880ms
    == Exit summary:
    == Exit code: 0
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Creating cache...
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache created in 966ms
    Finished: SUCCESS
  3. But, as far as they are very long builds (4 hours or more), sometimes, there is some disconnection between the controller and the agents in the middle. And that's not a problem for the tests, because near always... they continue running and the agent reconnects with the controller automatically.
  4. In those cases, when a disconnection has happened, no matter that the tests have ended ok... there is something in the build post-actions that causes the build to fail. See, for example, this run of the same job above:
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Searching cache in job specific caches...
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Found cache in job specific caches
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Restoring cache...
    [Cache for timing with id 4ad8aa3a3571ea912a6ec5ea5fdcc93c] Cache restored in 208ms
    ...................................................................... 9030
    .........Cannot contact hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@2a344988:JNLP4-connect connection from": Remote call on JNLP4-connect connection from failed. The channel is closing down or has closed down
    ............................................................. 9100
    ...................................................................... 9170
    ...................................................................... 9240
    == Exit summary:
    == Exit code: 0
    [Pipeline] End of Pipeline
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(
    at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(
    at org.jenkinsci.remoting.protocol.IOHub$
    at jenkins.util.ContextResettingExecutorService$
    at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(
    Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 5447d9be-a08c-48f7-97fa-a2a8797719c0
    Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@2a344988:JNLP4-connect connection from": Remote call on JNLP4-connect connection from failed. The channel is closing down or has closed down
    at hudson.FilePath.act(
    at hudson.FilePath.act(
    at jenkins.plugins.jobcacher.ArbitraryFileCache$SaverImpl.calculateSize(
    at jenkins.plugins.jobcacher.CacheManager.exceedsMaxCacheSize(
    at jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.complete(
    at jenkins.plugins.jobcacher.pipeline.CacheStepExecution$ExecutionCallback.onSuccess(
    at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$SuccessAdapter.receive(
    at com.cloudbees.groovy.cps.Outcome.resumeFrom(
    at com.cloudbees.groovy.cps.Continuable$
    at com.cloudbees.groovy.cps.Continuable$
    at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(
    at org.codehaus.groovy.runtime.GroovyCategorySupport.use(
    at com.cloudbees.groovy.cps.Continuable.run0(
    at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(
    at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(
    at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$
    at java.base/
    at hudson.remoting.SingleLaneExecutorService$
    at jenkins.util.ContextResettingExecutorService$
    at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(
    at java.base/java.util.concurrent.Executors$
    at java.base/
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.base/java.util.concurrent.ThreadPoolExecutor$
    at java.base/
    Finished: FAILURE
  5. As you can see, the tests end ok, but later, near the end, there is that error which stack-trace points (I think, I may be wrong!) to the JobCacher plugin

Expected Results

The job should end ok and the caches be set normally, no matter the brief disconnection in the middle of the (long) tests run.

Actual Results

Each time (we have been monitoring it since some good time ago and correlation is 1:1) that there is a brief disconnection, the job fails with the information above, no matter the tests themselves have ended ok.

Anything else?

We are still trying some runs not using the plugin, or keeping that json file saved is some other way to be 100% sure if the problem happens only with the plugin (aka, we aren't 100% sure yet). But decided to report it already, because the stack trace really seems to point it it.

We have tried both with S3 and in-controller storage (to discard factors) and the same behaviour happens no matter of the storage configured.

repolevedavaj commented 1 year ago

Hi @stronk7, thanks for the bug report. I totally agree that error seems to come from this plugin. I will create a branch with a change which could solve this issue. Are you able to install a "pre-release" of the plugin on Jenkins to see if it solves the issue?

stronk7 commented 1 year ago

Hi @repolevedavaj,

never have done before, but can try. Although it can be some good time till we have the disconnection problems leading to the current bug. I mean, we don't have the problem daily but whenever, for any reason, the disconnections happen (sort of, when the cable is unplugged, heh).

Thanks for looking to this!

repolevedavaj commented 1 year ago

@stronk7 no worries, just let me know if it solves your issue :) You can download the plugin from here:

stronk7 commented 1 year ago

Thanks @repolevedavaj ,

now we are using 388.v2c5fc2012a_89 here. Will keep an eye on all the jobs having disconnection problems and which their (hopefully passing) new outcome is.

Will report back as soon and we have some case, ciao :-)

stronk7 commented 1 year ago

Uhm... the 3 jobs that have finished since I updated the plugin have ended with an ugly java.lang.NullPointerException at then end, without any stack trace. Just guessing if that can happen because the plugin was updated in the middle of their execution.

So I've launched a new (quick, just a few minutes) build to see if that null pointer exception is happening to all the builds, in which case, I'll have to revert to the upstream version. For the records, the new build is this, let's see how it ends:

Ciao :-)

stronk7 commented 1 year ago

Ok, so it seems that new builds (previous comment) are passing ok and only those that were already running when I upgraded the plugin have been caught in the middle.

so I’m going to keep the dev plugin installed, let’s see…

Ciao :)

repolevedavaj commented 1 year ago

Thanks for the update!

stronk7 commented 1 year ago

Hi @repolevedavaj,

it has been a long wait... but I think I come with good news.

We have had at very least a couple of builds where the Cannot contact workerXXX: java.lang.InterruptedException happened in the middle of some long tests that ran ok no matter of the disconnection and then, the post-actions (including the JobCacher one) haven't caused any problem and the console is free from the reported above stack traces, with the job ending, as expected with an nice SUCCESS.

Here there are a couple of links to examples using the 388.v2c5fc2012a_89 version:

So, I'd say that your proposed changes really have fixed the reported problem, and now the JobCacher is immune to those potential disconnections happening in the middle. Great work!

Ciao :-)

repolevedavaj commented 1 year ago

Hi @stronk7 , thanks for the feedback! I merged the change (which will trigger the automatic release) :)