timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-22395] Run.delete (from LogRotator) failing with "...looks to have already been deleted" #5020

Closed timja closed 7 years ago

timja commented 10 years ago

Sometimes "log rotation" fails with an exception like

SEVERE  hudson.model.Run#execute: Failed to rotate log
java.io.IOException: .../jobs/.../modules/...$.../builds/2013-... looks to have already been deleted
    at hudson.model.Run.delete(Run.java:1432)
    at hudson.maven.MavenModuleSetBuild.delete(MavenModuleSetBuild.java:420)
    at hudson.tasks.LogRotator.perform(LogRotator.java:136)
    at hudson.model.Job.logRotate(Job.java:437)
    at hudson.maven.MavenModuleSet.logRotate(MavenModuleSet.java:851)
    at hudson.model.Run.execute(Run.java:1728)
    at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:509)
    at hudson.model.ResourceController.execute(ResourceController.java:88)
    at hudson.model.Executor.run(Executor.java:246)

Usually MavenModuleSetBuild is involved, perhaps suggesting a problem with deleted or skipped module builds (I cannot reproduce in either scenario), though I think I have also seen this happen for freestyle builds. Unclear whether the directory actually exists but File.isDirectory thinks it does not (the code originally tried to delete the dir without checking this and failed with "... is in use"); or whether the directory was actually deleted earlier but this Run was not cleaned up properly for some reason.

Hypothesis: AbstractLazyLoadRunMap.idOnDisk is not removed by removeValue, called from AbstractProject.removeRun (from Run.delete). Perhaps something is later resurrecting the AbstractBuild from idOnDisk, making it again available for another round of log rotation? But if the actual directory was deleted, it is hard to see how: load would just fail.

Or perhaps Run.delete is the race condition: the run is removed from the parent after its directory has been deleted. The method is synchronized, but that does not help if two copies of the Run exist, which might happen due to other lazy-loading bugs.

Diagnostics added to date:


Originally reported by jglick, imported from: Run.delete (from LogRotator) failing with "...looks to have already been deleted"
  • assignee: jglick
  • status: Resolved
  • priority: Major
  • resolution: Fixed
  • resolved: 2017-08-08T22:18:31+00:00
  • imported: 2022/01/10
timja commented 10 years ago

jglick:

Not to be confused with JENKINS-19377, which had a similar stack trace but was easily reproducible and limited to external monitor jobs.

timja commented 10 years ago

jglick:

Diagnostic build: http://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/1.532.2.JENKINS-22395-diag/jenkins-war-1.532.2.JENKINS-22395-diag.war

timja commented 10 years ago

jglick:

Enhanced logging suggests that when MavenModuleSetBuild.delete leads to this error, it is deleting a MavenModuleBuild with the same number (successfully), and then (via getModuleBuilds) one with a later number that had already been deleted during an earlier MavenModuleSetBuild.delete. In other words, AbstractBuild.getNextBuild is returning a build which was already (recently) deleted from its parent! That suggests that dropLinks failed to actually drop the link from the older build—perhaps because the referent of previousBuildR in the newer build was null, perhaps because the older build was collected and then recreated.

timja commented 10 years ago

jglick:

Attaching patch of new backport of purported fix (as well as diagnostics): http://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/1.532.3.JENKINS-22395/jenkins-war-1.532.3.JENKINS-22395.war

timja commented 10 years ago

jglick:

Build of https://github.com/jglick/jenkins/compare/jenkinsci:jenkins-1.532.3...JENKINS-22395-diag available: http://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/1.532.3.JENKINS-22395-2/jenkins-war-1.532.3.JENKINS-22395-2.war

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/hudson/model/Run.java
http://jenkins-ci.org/commit/jenkins/c1fcb08e08c3ccae1451602af8ea20788dc9950e
Log:
JENKINS-17236 Run.deleteArtifacts needs to go through ArtifactManager.
(Also relevant to JENKINS-22395 diagnostics.)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/hudson/model/Run.java
core/src/main/java/hudson/tasks/LogRotator.java
core/src/main/java/jenkins/model/StandardArtifactManager.java
http://jenkins-ci.org/commit/jenkins/3ef8063e43640952442de90e034dadba2ada337c
Log:
JENKINS-22395 Yet more diagnostics.

timja commented 10 years ago

dogfood:

Integrated in jenkins_main_trunk #3347
JENKINS-22395 Yet more diagnostics. (Revision 3ef8063e43640952442de90e034dadba2ada337c)

Result = SUCCESS
Jesse Glick : 3ef8063e43640952442de90e034dadba2ada337c
Files :

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/f1430a25ec5481b3635d6cf4fb7ea57a890e50f0
Log:
[FIXED JENKINS-22395] Attempt to double-delete a Run probably caused by getNextBuild sometimes returning a build which was already deleted.

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
changelog.html
http://jenkins-ci.org/commit/jenkins/ce9d47774e0eb67129099268f047f2d64f4667a7
Log:
JENKINS-22395 Noting merge of #1190.

Compare: https://github.com/jenkinsci/jenkins/compare/0761c7795d16...ce9d47774e0e

timja commented 10 years ago

dogfood:

Integrated in jenkins_main_trunk #3360
[FIXED JENKINS-22395] Attempt to double-delete a Run probably caused by getNextBuild sometimes returning a build which was already deleted. (Revision f1430a25ec5481b3635d6cf4fb7ea57a890e50f0)
JENKINS-22395 Noting merge of #1190. (Revision ce9d47774e0eb67129099268f047f2d64f4667a7)

Result = SUCCESS
Jesse Glick : f1430a25ec5481b3635d6cf4fb7ea57a890e50f0
Files :

Jesse Glick : ce9d47774e0eb67129099268f047f2d64f4667a7
Files :

timja commented 10 years ago

jglick:

Attempted fix did not work.

timja commented 10 years ago

danielbeck:

A comment posted in (otherwise unrelated) JENKINS-16805 indicates this can happen after Jenkins fails to delete a renamed build directory once.

Hypothesis: Run.delete() renames the build dir from DATE to .DATE first, then deletes that dir. If deleting it fails twice (e.g. due to file locking) Jenkins does not remove the build from the project in memory. Any subsequent deletion attempts would then fail to find the DATE build dir and print that message.

In this case, reloading from disk resolved the issue (as the renamed dir no longer matches the required pattern).

timja commented 10 years ago

jglick:

Maybe such a problem also exists, but that is not the bug I am tracking. In the case reported to me, Run.delete evidently succeeds, but the Run is still returned from getNextBuild of its immediate predecessor a moment later, which should never happen. (getNextBuild should only return a nondeleted build, or null if this was the most recent nondeleted build.) The reported exception then follows because in the case of a Maven module set, there can be multiple builds of a given module associated with one build of the module set—the build of matching number, and then any succeeding builds up until some cutoff—and all are deleted when the module set build is deleted in the course of routine LogRotator cleanup.

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/53e95ee517259e1e10d7209bcbaa4314397c1f40
Log:
JENKINS-22395

I'm not sure if I understand how the original fix in PR #1190 (f1430a2) addresses the problem.

My understanding is of the problem is as follows:

Presumably f1430a2 addresses this somehow, but I think I can induce this situation in other ways,
which is what dropLinksAfterGC2() does.

In this test,

initial state:

b1 <> b2 <> b3

we load everyone and connect them all up

after deleting b2:

b1 <--------> b3
^ ^
---- b2 ----

b1 and b3 points each other, and b2 still refers to each side

after dropping b1:
b2 --> b3

now, here, when I do b2.getPreviousBuild(), it loads b1a and it sets b1a.nextBuildR to b2.

b1a <-> b2 --> b3

So I claim this is a proof that the fix is incomplete, even for the problem JENKINS-22395 has discovered.

I don't think that the problem is for the dropLinks call to fail to patch up references.
The problem is that b2.getPreviousBuild() forcing b1 to point to b2, because if b2 is deleted and assumed to be
invalid, then no matter what bX this method will find you never want bX.nextBuildR to point to b2.

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82
Log:
JENKINS-22395 correcting the control test

As I step-executed the code, I discovered b2.getPreviousBuild() was getting invoked
between BRHF.drop(b1) and b2.dropLinks() call, in PeepholePermalink.RunListenerImpl.onDeleted()
because Run.delete() calls RunListener.fireDeleted(this).

Thus in effect the sequence of the call order was as follows:
assertEquals(1, BRHF.drop(b1));
b2.getPreviousBuild(); // happens indirectly in PeepholePermalink
b2.delete();
FreeStyleBuild b1a = b2.getPreviousBuild();

This defeats the purpose of the fix, because in this call sequence, by b2.dropLinks()
as implemented before f1430a2 will correctly fix up b1a.nextBuildR to b3.

For the test to work as intended, it is important that b2.previousBuildR.get()==null
during dropLinks(). That is what causes b2.getPreviousBuild() in the next line to go
load b1a, and sets up b1a.nextBuildR to b2, and trigger the assertion error.

Added a code to remove all RunListeners. With this change, the test now fails without
the fix in f1430a2.

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/BuildReference.java
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/b6226ad2d1a332cb661ceb5c5f5b673771118e14
Log:
[FIXED JENKINS-22395] redoing the fix in f1430a2

Based on the last few commits, I proved that the original fix in f1430a2
doesn't really address the problem.

That is, once b2 is deleted, and after sufficient garbage collection,
we can make b2.previousBuild.get() be null, and then
b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2.

In this commit, I roll back that part of f1430a2, and then fix the
problem differently.

I started thinking that the main problem we are trying to fix here
is that the deleted build object should be unreferenceable. That is,
it should behave almost as if the object has already been GCed.
The easiest way to do this is to clear a BuildReference object,
since we always use the same BuildReference object for all inbound
references.

This change allows us to clear BuildReference. Code like
b2.getPreviousBuild() will continue to try to update
b1.nextBuildR to b2, but it will only end up wiping out the field,
only to have b1.getNextBuild() recompute the correct value.

This fix makes both test cases pass in LazyBuildMixInTest.

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece
Log:
JENKINS-22395 more diagnostics

Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered

timja commented 10 years ago

dogfood:

Integrated in jenkins_main_trunk #3507
JENKINS-22395 (Revision 53e95ee517259e1e10d7209bcbaa4314397c1f40)
JENKINS-22395 correcting the control test (Revision 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82)
[FIXED JENKINS-22395] redoing the fix in f1430a2 (Revision b6226ad2d1a332cb661ceb5c5f5b673771118e14)

Result = SUCCESS
kohsuke : 53e95ee517259e1e10d7209bcbaa4314397c1f40
Files :

kohsuke : 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82
Files :

kohsuke : b6226ad2d1a332cb661ceb5c5f5b673771118e14
Files :

timja commented 10 years ago

danielbeck:

Is this a suitable lts-candidate, or too dangerous?

timja commented 10 years ago

dogfood:

Integrated in jenkins_main_trunk #3508
JENKINS-22395 more diagnostics (Revision aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

Result = SUCCESS
kohsuke : aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece
Files :

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/b7ec8578f938864af9c8dbfb73a54d67f972d694
Log:
JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC.
Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks.
(Also that they fail as expected in assertNotSame if the reference is not cleared at all.)

timja commented 10 years ago

dogfood:

Integrated in jenkins_main_trunk #3509
JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC. (Revision b7ec8578f938864af9c8dbfb73a54d67f972d694)

Result = SUCCESS
Jesse Glick : b7ec8578f938864af9c8dbfb73a54d67f972d694
Files :

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
cli/pom.xml
core/pom.xml
plugins/pom.xml
pom.xml
test/pom.xml
war/pom.xml
http://jenkins-ci.org/commit/jenkins/87df95af49f290cd16f6179d70bc05df411c91ac
Log:
Setting version: 1.565.1.JENKINS-22395-dropLinks

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/4910863dbdb944f249fa86c4ca583e53c736ed69
Log:
JENKINS-22395

I'm not sure if I understand how the original fix in PR #1190 (f1430a2) addresses the problem.

My understanding is of the problem is as follows:

Presumably f1430a2 addresses this somehow, but I think I can induce this situation in other ways,
which is what dropLinksAfterGC2() does.

In this test,

initial state:

b1 <> b2 <> b3

we load everyone and connect them all up

after deleting b2:

b1 <--------> b3
^ ^
---- b2 ----

b1 and b3 points each other, and b2 still refers to each side

after dropping b1:
b2 --> b3

now, here, when I do b2.getPreviousBuild(), it loads b1a and it sets b1a.nextBuildR to b2.

b1a <-> b2 --> b3

So I claim this is a proof that the fix is incomplete, even for the problem JENKINS-22395 has discovered.

I don't think that the problem is for the dropLinks call to fail to patch up references.
The problem is that b2.getPreviousBuild() forcing b1 to point to b2, because if b2 is deleted and assumed to be
invalid, then no matter what bX this method will find you never want bX.nextBuildR to point to b2.

(cherry picked from commit 53e95ee517259e1e10d7209bcbaa4314397c1f40)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/02e9f36dac05e533f7dc28325def85d4d937fb9e
Log:
JENKINS-22395 correcting the control test

As I step-executed the code, I discovered b2.getPreviousBuild() was getting invoked
between BRHF.drop(b1) and b2.dropLinks() call, in PeepholePermalink.RunListenerImpl.onDeleted()
because Run.delete() calls RunListener.fireDeleted(this).

Thus in effect the sequence of the call order was as follows:
assertEquals(1, BRHF.drop(b1));
b2.getPreviousBuild(); // happens indirectly in PeepholePermalink
b2.delete();
FreeStyleBuild b1a = b2.getPreviousBuild();

This defeats the purpose of the fix, because in this call sequence, by b2.dropLinks()
as implemented before f1430a2 will correctly fix up b1a.nextBuildR to b3.

For the test to work as intended, it is important that b2.previousBuildR.get()==null
during dropLinks(). That is what causes b2.getPreviousBuild() in the next line to go
load b1a, and sets up b1a.nextBuildR to b2, and trigger the assertion error.

Added a code to remove all RunListeners. With this change, the test now fails without
the fix in f1430a2.

(cherry picked from commit 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/BuildReference.java
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/ad65cecc7825693af7378d1d223966bac40ef37a
Log:
[FIXED JENKINS-22395] redoing the fix in f1430a2

Based on the last few commits, I proved that the original fix in f1430a2
doesn't really address the problem.

That is, once b2 is deleted, and after sufficient garbage collection,
we can make b2.previousBuild.get() be null, and then
b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2.

In this commit, I roll back that part of f1430a2, and then fix the
problem differently.

I started thinking that the main problem we are trying to fix here
is that the deleted build object should be unreferenceable. That is,
it should behave almost as if the object has already been GCed.
The easiest way to do this is to clear a BuildReference object,
since we always use the same BuildReference object for all inbound
references.

This change allows us to clear BuildReference. Code like
b2.getPreviousBuild() will continue to try to update
b1.nextBuildR to b2, but it will only end up wiping out the field,
only to have b1.getNextBuild() recompute the correct value.

This fix makes both test cases pass in LazyBuildMixInTest.

(cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/921001d5245c4f441023e8ca16351e7ca84dabb2
Log:
JENKINS-22395 more diagnostics

Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered

(cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/63e806efb6c6ba38c5043e1916ff4f19d425ea0e
Log:
JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC.
Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks.
(Also that they fail as expected in assertNotSame if the reference is not cleared at all.)

(cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)

Compare: https://github.com/jenkinsci/jenkins/compare/87df95af49f2^...63e806efb6c6

timja commented 10 years ago

jglick:

Testing build of https://github.com/jenkinsci/jenkins/compare/1.565.1-rc1...JENKINS-22395-dropLinks-backport uploaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/1.565.1.JENKINS-22395-dropLinks/jenkins-war-1.565.1.JENKINS-22395-dropLinks.war

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Daniel Beck
Path:
changelog.html
http://jenkins-ci.org/commit/jenkins/fdc0b5c5650b3ee849f02e3c5d94d23c12886adc
Log:
Noting #1314, #1316, #1308, JENKINS-17667, JENKINS-22395, JENKINS-18065

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Daniel Beck
Path:
changelog.html
http://jenkins-ci.org/commit/jenkins/72f37e414533c02624f9deffc85a2b50f03905e6
Log:
Remove JENKINS-22395, JENKINS-18065 from 1.572 section

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/2baa7caade6923b9a2023e64b60bb537fcc65077
Log:
JENKINS-22395

I'm not sure if I understand how the original fix in PR #1190 (f1430a2) addresses the problem.

My understanding is of the problem is as follows:

Presumably f1430a2 addresses this somehow, but I think I can induce this situation in other ways,
which is what dropLinksAfterGC2() does.

In this test,

initial state:

b1 <> b2 <> b3

we load everyone and connect them all up

after deleting b2:

b1 <--------> b3
^ ^
---- b2 ----

b1 and b3 points each other, and b2 still refers to each side

after dropping b1:
b2 --> b3

now, here, when I do b2.getPreviousBuild(), it loads b1a and it sets b1a.nextBuildR to b2.

b1a <-> b2 --> b3

So I claim this is a proof that the fix is incomplete, even for the problem JENKINS-22395 has discovered.

I don't think that the problem is for the dropLinks call to fail to patch up references.
The problem is that b2.getPreviousBuild() forcing b1 to point to b2, because if b2 is deleted and assumed to be
invalid, then no matter what bX this method will find you never want bX.nextBuildR to point to b2.

(cherry picked from commit 53e95ee517259e1e10d7209bcbaa4314397c1f40)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/3f0388379b06b31ebec1bf17cc976001fcd80b7b
Log:
JENKINS-22395 correcting the control test

As I step-executed the code, I discovered b2.getPreviousBuild() was getting invoked
between BRHF.drop(b1) and b2.dropLinks() call, in PeepholePermalink.RunListenerImpl.onDeleted()
because Run.delete() calls RunListener.fireDeleted(this).

Thus in effect the sequence of the call order was as follows:
assertEquals(1, BRHF.drop(b1));
b2.getPreviousBuild(); // happens indirectly in PeepholePermalink
b2.delete();
FreeStyleBuild b1a = b2.getPreviousBuild();

This defeats the purpose of the fix, because in this call sequence, by b2.dropLinks()
as implemented before f1430a2 will correctly fix up b1a.nextBuildR to b3.

For the test to work as intended, it is important that b2.previousBuildR.get()==null
during dropLinks(). That is what causes b2.getPreviousBuild() in the next line to go
load b1a, and sets up b1a.nextBuildR to b2, and trigger the assertion error.

Added a code to remove all RunListeners. With this change, the test now fails without
the fix in f1430a2.

(cherry picked from commit 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/BuildReference.java
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/4bfa16143705e219705ff667c6917a2d6a6a8939
Log:
[FIXED JENKINS-22395] redoing the fix in f1430a2

Based on the last few commits, I proved that the original fix in f1430a2
doesn't really address the problem.

That is, once b2 is deleted, and after sufficient garbage collection,
we can make b2.previousBuild.get() be null, and then
b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2.

In this commit, I roll back that part of f1430a2, and then fix the
problem differently.

I started thinking that the main problem we are trying to fix here
is that the deleted build object should be unreferenceable. That is,
it should behave almost as if the object has already been GCed.
The easiest way to do this is to clear a BuildReference object,
since we always use the same BuildReference object for all inbound
references.

This change allows us to clear BuildReference. Code like
b2.getPreviousBuild() will continue to try to update
b1.nextBuildR to b2, but it will only end up wiping out the field,
only to have b1.getNextBuild() recompute the correct value.

This fix makes both test cases pass in LazyBuildMixInTest.

(cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
http://jenkins-ci.org/commit/jenkins/ca156f730b9aedac9072f7243e3f1528330ca68b
Log:
JENKINS-22395 more diagnostics

Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered

(cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

timja commented 10 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
http://jenkins-ci.org/commit/jenkins/579753403c63886cf0feff177602d79254829780
Log:
JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC.
Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks.
(Also that they fail as expected in assertNotSame if the reference is not cleared at all.)

(cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)

timja commented 7 years ago

svanoort:

Reproducible again with a high enough build rate and pipeline (lots of concurrent or fast builds):

Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
INFO: sample #8540 completed: SUCCESS
Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
WARNING: failed to save sample #8541 or perform log rotation
java.io.IOException: sample #8527: /Users/svanoort/Documents/jenkins-plugins-oss/random-job-builder-plugin/work/jobs/sample/builds/8527 looks to have already been deleted; siblings: [.8527, 4568, 4569, 4570, 4571, 8528, 8529, 8530, 8531, 8532, 8533, 8534, 8535, 8536, 8537, 8538, 8539, 8540, 8541, 8542, 8543, 8544, 8545, 8546, 8547, lastFailedBuild, lastStableBuild, lastSuccessfulBuild, lastUnstableBuild, lastUnsuccessfulBuild, legacyIds]
at hudson.model.Run.delete(Run.java:1480)
at hudson.tasks.LogRotator.perform(LogRotator.java:130)
at hudson.model.Job.logRotate(Job.java:478)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:597)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1200(WorkflowRun.java:120)
at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:865)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1081)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:400)
at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

timja commented 7 years ago

jglick:

Original bug was indeed fixed, backported, etc. so what you are seeing svanoort is something else with similar symptoms. Better to file a fresh issue and link it to this one.

timja commented 2 years ago

[Originally related to: JENKINS-19377]

timja commented 2 years ago

[Originally depends on: JENKINS-25788]