timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-61687] Exceptions due to race condition during build deletion #4925

Closed timja closed 4 years ago

timja commented 4 years ago

Messages like the following are being logged on ci.jenkins.io after updating to 2.222.1.

2020-03-25 19:18:34.958+0000 [id=21774]    WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build Discarder
Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/93295 -> /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/.93295
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
        at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1421)
        at hudson.model.Run.delete(Run.java:1621)
        at hudson.tasks.LogRotator.perform(LogRotator.java:166)
jenkins.util.io.CompositeIOException: Failed to rotate logs for [Infra/plugin-site-api/generate-data #93295]
    at hudson.tasks.LogRotator.perform(LogRotator.java:223)
    at hudson.model.Job.logRotate(Job.java:469)
    at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54)
    at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61)
    at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49)
    at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255)
    at hudson.model.Run.onEndBuilding(Run.java:2018)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:617)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1018)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:488)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834) 

or

2020-03-25 19:28:42.053+0000 [id=56]   WARNING o.j.p.workflow.job.WorkflowRun#lambda$finish$2: failed to perform log rotation after Infra/plugin-site-api/generate-data #93303
Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/93296 -> /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/.93296
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
        at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1421)
        at hudson.model.Run.delete(Run.java:1621)
        at hudson.tasks.LogRotator.perform(LogRotator.java:166)
jenkins.util.io.CompositeIOException: Failed to rotate logs for [Infra/plugin-site-api/generate-data #93296]
    at hudson.tasks.LogRotator.perform(LogRotator.java:223)
    at hudson.model.Job.logRotate(Job.java:469)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.lambda$finish$2(WorkflowRun.java:612)
    at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834) 

None of them occurred while background deletion was ongoing.


Originally reported by danielbeck, imported from: Exceptions due to race condition during build deletion
  • status: Resolved
  • priority: Minor
  • resolution: Fixed
  • resolved: 2020-08-26T07:44:49+00:00
  • imported: 2022/01/10
timja commented 4 years ago

danielbeck:

Obvious problem is that we call the globally configured-by-default "Job Build Discarder" when a build finishes, even though it's redundant. Unclear why there seems to be a race condition here though.

Run#delete looks reasonable enough.

timja commented 4 years ago

danielbeck:

https://github.com/jenkinsci/jenkins/blob/jenkins-2.222.1/core/src/main/java/hudson/model/Run.java#L1614-L1638

timja commented 4 years ago

danielbeck:

The problem occurs when many builds finish around the same time and is unrelated to the global build discarder. That may make it more likely to occur, unsure.

timja commented 4 years ago

danielbeck:

Run#delete looks reasonable enough.

Correction: This method makes absolutely no sense.

timja commented 4 years ago

danielbeck:

FWIW and AFAICT, the exceptions are annoying, but don't indicate actual problems affecting Jenkins during runtime. The same build is "just" being deleted twice, and one of the attempts ends up failing because there's nothing left to delete.

timja commented 4 years ago

kon:

The problem occurs when many builds finish around the same time and is unrelated to the global build discarder.

No, we got "An exception occurred when executing Project Build Discarder" from three consecutive builds even though there were several minutes of idle time between them.

Build Start Get node Finish Build Discarder
JobA #89 08:38:24 08:38:24 08:44:29 08:44:29 NoSuchFileException deleting JobA #39
JobB #579 08:49:18 08:49:19 08:53:32 08:53:53 NoSuchFileException deleting JobB #529
JobA #90 10:18:40 10:18:42 10:24:18 10:24:18 NoSuchFileException deleting JobA #40
JobB #580 10:19:50 10:24:17 10:28:53 No error

Periodic background build discarder ran hourly at **:26:01 or **:26:02, taking less than a second each time.

(I hypothesize that each build runs Project Build Discarder twice in parallel and those race against each other, and that build JobB #580 avoided the error because Periodic background build discarder had already discarded the build and both instances of Project Build Discarder then noticed that.)

The call stack is a bit different from what is in the issue description, because this is on Windows:

toukokuuta 06, 2020 10:24:18 AP. WARNING jenkins.model.BackgroundGlobalBuildDiscarder lambda$processJob$0

An exception occurred when executing Project Build Discarder
Also:   java.nio.file.NoSuchFileException: REDACTED\branches\master\builds\40 -> REDACTED\branches\master\builds\.40
        at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79)
        at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
        at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
        at java.nio.file.Files.move(Files.java:1395)
        at hudson.model.Run.delete(Run.java:1621)
        at hudson.tasks.LogRotator.perform(LogRotator.java:166)
jenkins.util.io.CompositeIOException: Failed to rotate logs for [REDACTED/master #40]
    at hudson.tasks.LogRotator.perform(LogRotator.java:223)
    at hudson.model.Job.logRotate(Job.java:469)
    at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54)
    at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67)
    at jenkins.model.BackgroundGlobalBuildDiscarder$$Lambda$230/0000000000000000.accept(Unknown Source)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61)
    at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49)
    at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255)
    at hudson.model.Run.onEndBuilding(Run.java:2018)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:625)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1026)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:488)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:821)
timja commented 4 years ago

danielbeck:

(I hypothesize that each build runs Project Build Discarder twice in parallel and those race against each other

I don't remember how far along I was in the investigation when I posted the comment, but IIRC this is correct.

Specifically, the "background" build discarders are also run after a build finishes, in addition to the regular build discarder. The idea being, why should it only run periodically, when it could as well run when a build is finished, and not accumulate a lot of builds for its periodic run?

https://github.com/jenkinsci/jenkins/blob/da90af311587f6c3d37ec4e9c4b4637763924743/core/src/main/java/jenkins/model/GlobalBuildDiscarderListener.java#L46-L50

Now, if you configure the "job specific build discarder" as part of global/background build discarders (the default), there'll be two build discarder instantiations running after a build is finished, both configured to delete the exact same builds. So this configuration basically creates the circumstances under which the (AFAICT) long-standing concurrency bug occurs. (Strictly speaking, any global build discarder could trigger this, but if the configurations are different, it's likely that only one build discarder will ever find builds to delete.)

A workaround for this specific case only could be to skip the "project specific build discarder" when we're triggered by a build finishing (i.e. only run it when triggered by the periodic run), but that seems like a hack, but doesn't consider fixed global build discarders that match the job configuration exactly.

A more reasonable workaround, don't run multiple build discarders on a job in parallel. There's currently no synchronization there.


Ideally however, we'd fix the concurrency issues in build deletion, then neither builds in quick succession, nor background build discarders will be a problem.

timja commented 4 years ago

kon:

Would it be reasonable to specifically catch and swallow java.nio.file.NoSuchFileException from the Files.move call? Or would that just surface more concurrency problems in other parts of hudson.model.Run.delete?

timja commented 4 years ago

gmshake:

danielbeck I managed to reproduce this issue by a simple pipeline job.

properties ([buildDiscarder(logRotator(numToKeepStr: '1'))])

node {
    echo "Running ..."
}

Environment : Jenkins 2.235.2, Pipeline 2.6, with `Project Build Discarder` configured.

Steps to reproduce:

  1. Create a pipeline job with above script
  2. Hit `Build Now` for the initial build.
  3. Repeat step 2 and you will get `java.nio.file.NoSuchFileException` in system log.
timja commented 4 years ago

gmshake:

For the issue in the above simple pipeline job, I can conclude it was caused by concurrency in these lines:
https://github.com/jenkinsci/workflow-job-plugin/blob/81fa9191779d54f0641b3ff0cec4aeeb3dad3bbb/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L618-L625

    Timer.get().submit(() -> {
try {
    getParent().logRotate();
} catch (Exception x) {
    LOGGER.log(Level.WARNING, "failed to perform log rotation after " + this, x);
}
    });
    onEndBuilding();

the `onEndBuilding()` calls `RunListener.fireFinalized()` and finally `GlobalBuildDiscarderListener#onFinalized()` was called, and then there're two concurrent thread doing `logRotate`.

timja commented 4 years ago

gmshake:

danielbeck Draft PR filed: https://github.com/jenkinsci/jenkins/pull/4850

Sorry I've not tested locally as the change was made online. I've slow internet speed when cloning the jenkins repository.

I'll report back later when the regression tests are ready.

timja commented 4 years ago

gmshake:

Local test looks good with the fix

timja commented 4 years ago

allan_burdajewicz:

danielbeck By the way, isn't the "JobGlobalBuildDiscarderStrategy" supposed to run periodically ? According to the documentation:

Build discarders configured for a job are only run after a build finishes. This option runs jobs' configured build discarders periodically, applying configuration changes even when no new builds are run. This option has no effect if there is no build discarder configured for a job.

timja commented 4 years ago

danielbeck:

allan_burdajewicz

It does; and additionally it runs global build discarders once on a project when a build finishes. I thought it makes no sense to wait up to an hour (IIRC) to delete builds. I think the result is much nicer this way (except for exposing this bug in many situations, of course). The documentation probably just didn't keep up with the evolution of the feature and could be improved.

timja commented 4 years ago

oleg_nenashev:

Created JENKINS-63275 as a follow-up discussed in the PR

timja commented 4 years ago

antonio_petricca:

Same problem for me:

2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build Discarder

2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build DiscarderAlso:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/8 -> /var/jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/.8 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at hudson.model.Run.delete(Run.java:1645) at hudson.tasks.LogRotator.perform(LogRotator.java:166)jenkins.util.io.CompositeIOException: Failed to rotate logs for [simp/tasks%2FBET-31993 #8] at hudson.tasks.LogRotator.perform(LogRotator.java:223) at hudson.model.Job.logRotate(Job.java:469) at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54) at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67) at java.lang.Iterable.forEach(Iterable.java:75) at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61) at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49) at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255) at hudson.model.Run.onEndBuilding(Run.java:2042) at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:625) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137) at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:489) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

 

timja commented 2 years ago

[Originally duplicated by: JENKINS-61256]

timja commented 2 years ago

[Originally duplicated by: JENKINS-61763]

timja commented 2 years ago

[Originally duplicated by: JENKINS-61958]