[JENKINS-9913] Not obvious why some post-build tasks enforce serial behavior even when builds are concurrent

timja commented 13 years ago

We're experiencing an issue with concurrent builds where Jenkins appears to be associating separate builds (run on different machines) such that they won't be marked as completed until all jobs are completed. For example, if we kick off 5 concurrent builds on 5 different nodes, builds 1-4 won't be marked as completed if build #5 is still running, even though builds 1-4 are finished. I've seen a report of someone experiencing this issue elsewhere:

http://groups.google.com/group/jenkinsci-users/browse_thread/thread/e477e25910266d2a?fwc=1

but a solution wasn't posted. We do not have the batch plugin or the locks and latches plugin installed. We've disabled all post-build processing and switched between different containers (Glassfish/Tomcat), but the problem persists. I couldn't find an issue logged for this other than the aforementioned posting.

Originally reported by pomvr, imported from: Not obvious why some post-build tasks enforce serial behavior even when builds are concurrent

assignee: jglick
status: Resolved
priority: Major
resolution: Fixed
resolved: 2013-08-06T22:09:21+00:00
imported: 2022/01/10

timja commented 13 years ago

tiainpa:

We have a similar issue: The build which should finish first gets stuck due to a bug in our test framework (last line in build console is 'Recording test results'), and all the concurrent builds running at the same time get stuck on the same phase, finishing only when the first build is forcefully killed.

This might have something to do with the JUnit test result report publishing?

timja commented 12 years ago

ahawtho:

We see the same thing - we have a test job that can be triggered with a few different upstream jobs to specify different sets of tests to run with different test parameters. We use the parameterized trigger plugin to kick off multiple concurrent instances of the downstream job. One of our "trigger" jobs specifies a set of long running performance tests (about 10 hours). Another is a regression test suite, which only takes a few minutes. We've seen instances at least twice where one of the regression tests will start after the long-running performance tests. The short-lived regression test job will remain running until the performance test completes, which can be ten hours later. The console log shows the regression test job as completed, having archived and recorded fingerprints, but it's still marked as running. Furthermore, an attempt to stop the regression test job caused the long-running performance test job to stop. Several regression tests would normally be queued during this time. This seems to have been recent in an upgrade to 1.448. I'm not sure what we were running immediately before that.

timja commented 12 years ago

dannystaple:

We were hoping that disabling the Junit post build task would stop this - it didn't do so. This has been killing us for some test jobs where we use a concurrent job to massively reduce duplication of configuration (and all the headaches that come with that).

timja commented 12 years ago

lneff:

I saw this also this evening. I use the throttle concurrent builds plugin and had four instances of the same parameterized job running. The three that finished weren't released and hogged their executors until the fourth and last finished. Postbuild actions for all were: archive artifacts, Groovy postbuild script, fingerprint a file, set build description, and build other projects (extended) where the trigger was not satisfied, send email. No Junit tests.
Jenkins 1.428, running on Windows 7, and the jobs in question were all running on the master.
Throttle Concurrent Builds plugin v. 1.6

timja commented 12 years ago

xavier_leprevost:

I have the same problem.
I am using parameterized trigger where the parameters is a SVN branch.
I start :

job build 1 from branch1
job build 2 from branch2
job build 3 from branch3
job build 4 from branch4

They are all running fine, but if job 2 or 3 or 4 finish before 1. They are really finish only when job 1 is finish.
Then problem is the node resource not available then for starting a new job, and I am waiting the result of the build.

timja commented 12 years ago

svs57:

Jenkins ver. 1.458
Still have problem: concurrent build not finished while last build ended.

timja commented 12 years ago

svs57:

Any progress in resolve issue?

timja commented 12 years ago

lmattox:

We had a very similar issue, and found the problem to be in the email part of core somewhere.

When we disabled all plugins that extended email, and turned off build notification emails, our builds no longer block other concurrent builds that share the same name.

The the work around isn't ideal, as we then had to go add email code to several build scripts (that we are trying to simplify and remove maintenance from).

timja commented 12 years ago

svs57:

We have "Editable Email Notification".
It is difficult to us to use shell script for notifications.
Will be nice to fix this bug.

timja commented 12 years ago

svs57:

Any progress? Latest Jenkins has this bug.

timja commented 12 years ago

svs57:

Do anybody read this ticket?
This bug make our work slow.
Help!

timja commented 12 years ago

svs57:

Created: 08/Jun/11
Today: 08/Jun/12
Still no progress...

timja commented 12 years ago

soundrabbit:

This is a very unfortunate bug. I observed today that Jenkins finishes builds in the exact same order as they were started. So if you have builds #1, #2, #3 running, Jenkins will always let #1 finish first, then #2, then #3. It doesn't matter which of the builds finishes its actions first.

I wonder if there's an inherent limitation in Jenkins that a build #n cannot be running while #(n+x) has already completed.

timja commented 12 years ago

soundrabbit:

A short test confirms what Logan Mattox observed: it's due to the e-mail notifications (both with the built-in "E-Mail Notification" post-build action, and the email-ext plugin).

Some other post-build actions don't seem to trigger the problem: I tried launching a downstream job, and this allows build #2 to complete before #1.

However, the "Jenkins Text Finder" plugin ALSO triggers the problem. Maybe it's related to plugins that read the console output?

timja commented 12 years ago

sma:

Any idea when this will be fixed? this one's a blocker for me.

timja commented 12 years ago

xavier_leprevost:

I there any chance to have this fixed soon ?

timja commented 12 years ago

svs57:

I attached to screenshot to explain again what's happen.
Jobs 537, 538, 539 have finished as you can see on second screenshot.
But executors is still busy as you can see on first one.

timja commented 12 years ago

svs57:

Created: 08/Jun/11 9:43 PM

date
Mon Oct 22 16:15:51 MSK 2012

Resolution: Unresolved

timja commented 12 years ago

mlewicka:

A very unfortunate bug indeed, especially if one of the batch of jobs getting launched gets stuck for whatever reasons (remote execution over ssh, etc.). The stuck job will cause the whole batch to get stuck, and the possibility of this happening is exactly why we're using concurrent executions in first place...

timja commented 11 years ago

cbo:

Bitten by this issue also

timja commented 11 years ago

sma:

Do we have a public target date for this bug to be fixed?
It's getting kinda blocker for me.

timja commented 11 years ago

svs57:

I am afraid that no one is working on a the investigation of this bug.

timja commented 11 years ago

xavier_leprevost:

It would be nice to know if someone will investigate this problem.

timja commented 11 years ago

inbar_rose:

same problem here. total blocker. task A starts, then task B starts. task B reaches the 'Recording test results' stage and hangs until task A finishes. after testing with simple timed builds with many plugins/options enabled/disabled concluded that junit is the problem. - found another issue like this here: https://issues.jenkins-ci.org/browse/JENKINS-10234

timja commented 11 years ago

kutzi:

Similar to ~~JENKINS-10234~~ - for some causes even the same.

timja commented 11 years ago

kutzi:

There are several cause described here. Some are JUnit archiving, which is specifically handled in ~~JENKINS-10234~~. Some seem to be related to email notification.
I dare to say that in almost all cases this is a feature and not a bug as the logic for email notification and JUnit result archiving needs to wait until the previous builds are finished.

timja commented 11 years ago

svs57:

It is not a feature. All concurent builds execute in separated workspace and must archive its artifacts independently. No need to wait others builds if current have finished.

timja commented 11 years ago

kutzi:

Sergey, do you mean 'artifact archiving' or 'JUnit result archiving'. In the former case, you would be probably right, but I've seen no comment here that artifact archiving is blocking, too.
In the case of JUnit: yes it MUST block to calculate the diff (regressions et.al.) to the previous test results. So it is a feature.

timja commented 11 years ago

svs57:

I mean all result of concurent job.
For example. I have Jenkins job that execute E2E tests on different QA servers. I started 10 jobs. One job running 5 hours. But others takes 1 hour. Then I can't see results of finished job and have to wait 4 hours to see it. We must have instrument to prevent this.

timja commented 11 years ago

kutzi:

For that you can e.g. use the xunit plugin as mentioned in ~~JENKINS-10234~~.

Generally, don't use any build steps which require blocking behaviour. Yes, very unfortunately it's not visible for end users which build steps do and which don't.

timja commented 11 years ago

mantengamoslacalma:

How is this a feature?

I do not use JUnit. I have a gerrit plugin that triggers a (parameterized) build+test job (a build script, and another script that runs some tests, all in a small shell snippet) whenever code is pushed to the repository. When the job is completed (successfully or not), an email notification needs to go out to the authors. It doesn't get much simpler than this.

Commits are independent from each other, and as a consequence, so are the build jobs. Why would you want the email plugin to sit in some checkpoint if the job is done?

Is there a workaround for this? This causes horrendous problems in our setup: aside from the resource waste (many nodes spend long periods of time waiting for slower builds), bugs in the test code will cause the whole cluster to deadlock.

timja commented 11 years ago

svs57:

Do anybody know how to solve this problem? Any patch, workaround...?
I can't remove post build steps for the job

timja commented 11 years ago

teh444:

Happening in 1.515

timja commented 11 years ago

teh444:

Workaround is to remove all post-Build-Actions and have it trigger a project that will do the dirty work.
To ensure it runs on the same machine, use the "NodeLabel Parameter Plugin", and Add a "NodeLabel Parameter" with the textbox "name" = NODE_NAME and "node" = ${ENV,var="NODE_NAME"}

timja commented 11 years ago

svs57:

>Workaround is to remove all post-Build-Actions
It is not workaround.

timja commented 11 years ago

teh444:

Sergey, I may not have made myself clear.

For example:
Project 'A' is the project you wish to run in parallel.

Project 'A' has a bunch of really long shell/bat commands and requires some sort of post-build action that is causing this defect.
Which is: if Project 'A' is run in parallel and contains post-build actions, it will not release until it completes.

The solution: Create a second project 'B' that contains all your post-build actions from 'A'.
Finally, remove all the post-build actions in 'A' (as they are all duplicated in project 'B') and have 'A' trigger 'B' as the final step.

Additional notes:
You will need to use the "NodeLabel Parameter Plugin" to ensure that this project is run on the same machine.
You may also need to give it the ${WORKSPACE} parameter (and any others) if your post-build actions need to manipulate the artefacts generated from Project 'A'.

timja commented 11 years ago

svs57:

Thank you, 4 4. Now it's clear.
This will cause new problems, because my tests execute notification job in post-build step. This job analize what was upstream and send mail
I have to chage logic in notify job.

timja commented 11 years ago

jglick:

Not a bug per se, but Jenkins needs to make sure that post-build actions which require the previous build to be complete (a) are documented to do so, e.g. in inline help; (b) print a helpful message to the build log when waiting for a previous build. And of course wherever feasible, offer an option to not block in this way, or to perform the processing dependent on the previous build asynchronously, e.g. in a RunListener.

timja commented 11 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/hudson/model/CheckPoint.java
core/src/main/java/hudson/model/Run.java
core/src/main/java/hudson/tasks/BuildStepMonitor.java
core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java
core/src/main/resources/hudson/model/Messages.properties
http://jenkins-ci.org/commit/jenkins/eec307511c80112274d27f2a840d9f96cda784d3
Log:
[FIXED JENKINS-9913] At least print a diagnostic to the build log if we are waiting on a checkpoint.

Compare: https://github.com/jenkinsci/jenkins/compare/e5f5402cc2fd...eec307511c80

timja commented 11 years ago

dogfood:

Integrated in jenkins_main_trunk #2789
[FIXED JENKINS-9913] At least print a diagnostic to the build log if we are waiting on a checkpoint. (Revision eec307511c80112274d27f2a840d9f96cda784d3)

Result = SUCCESS
Jesse Glick : eec307511c80112274d27f2a840d9f96cda784d3
Files :

core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java
core/src/main/java/hudson/tasks/BuildStepMonitor.java
core/src/main/resources/hudson/model/Messages.properties
core/src/main/java/hudson/model/CheckPoint.java
core/src/main/java/hudson/model/Run.java

timja commented 11 years ago

surya548:

I am not sure how printing to logs resolves this. Can we reopen this.
Also can we make Run.waitForCheckpoint an instance method instead of a static method.

timja commented 11 years ago

jglick:

@surya548:

I am not sure how printing to logs resolves this.

As noted above, it just makes it clear why a given plugin is blocking the way it is. If there is a particular plugin which is blocking which you believe should not block, that should be filed in a separate issue.

can we make Run.waitForCheckpoint an instance method instead of a static method

I see no reason why that would be necessary, but this is probably not the place to discuss it anyway.

timja commented 11 years ago

surya548:

> If there is a particular plugin which is blocking which you believe should not block

I dont want any plugin to block on checkpoint because it doesn't make sense to compare two arbitrary branches in the repo.

> that should be filed in a separate issue.

I have so many plugins that do this here are just few example example : junitarchiver, tap, checkstyle, findbugs , corbertura

I would like a global disable checkpointing option.

timja commented 11 years ago

jglick:

it doesn't make sense to compare two arbitrary branches in the repo

True, but this needs to be solved at a higher level in Jenkins, by creating a separate AbstractProject for each branch, so that each has a linear build history. There is some work TBA that accomplishes this.

I would like a global disable checkpointing option.

Sorry, this needs to be implemented on a per-plugin basis, since plugins are not necessarily written to behave gracefully when their checkpoint expectations are unmet.

timja commented 11 years ago

surya548:

> so that each has a linear build history.
We dont care about none of this linear history, checkpointing stuff if builds are hung for days deadlocked on checkpoints.
There should be a way to turn this feature off. I think most people use git with lots of branches anyways these days and none of this is relevant/useful.

>plugins are not necessarily written to behave gracefully when their checkpoint expectations are unmet

if we just noop Run.waitForCheckpoint based on some global setting wouldn't plugins just assume there is nothing to compare against and behave gracefully?

timja commented 11 years ago

jglick:

I think most people use git with lots of branches

Exactly why separate branch projects are needed: so that each branch does not need to be marked concurrent-capable, avoiding wasted build and also avoiding waits on checkpoints as a corollary.

if we just noop Run.waitForCheckpoint based on some global setting wouldn't plugins just assume there is nothing to compare against and behave gracefully?

Depends on the plugin. Some may behave fine, but others may behave erratically because they are still looking up the “previous build” (which they assume to be complete based on their stated checkpoint semantics). Disabling the checkpoint is incompatible.

timja commented 11 years ago

surya548:

>Exactly why separate branch projects are needed.

I dont understand what you suggest that I do now. Our Jenkins Instance is basically unusable because of these useless checkpoints.
I am not sure if it is practical for me to open bugs against 9 plugins( that i know of, after looking at source code for all the plugins we use) and ask them to remove checkpoints.

Why can't we get rid of (or atleast make it optional) a obsolete feature designed for SVN/CVS .
As others have said in the comments above we would rather have stable Jenkins than some optional feature than no-one cares about.

timja commented 11 years ago

jglick:

I am not sure if it is practical for me to open bugs against 9 plugins

Why not?

feature designed for SVN/CVS

Applies equally to Git as to SVN; both support branches, and for any such SCM it is preferable to have one AbstractProject per branch so that each has a sensible linear history (as previously mentioned there is ongoing work in this regard). If you do that, and suppress parallel builds within a branch (i.e. may have concurrent builds only of distinct branches), then there is no further issue.

Now even within a branch it is sometimes desirable to permit parallel builds, when there is a premium on quick feedback over build-to-build comparisons; in that case plugins should be configurable to do no comparison to the previous build (i.e. you are intentionally waiving your right to information such as whether a given test case failure was a regression), and should not use checkpoints either.

some optional feature

The original behavior of build steps was to be unconditionally serialized. For compatibility, that default must remain. New plugins should consider their actual needs and override the relevant method to specify it (optionally introducing finer-grained checkpoints).

timja commented 11 years ago

surya548:

>Applies equally to Git as to SVN; both support branches.

There is a big big difference, svn branches are long lived and most git branches typically are short lived( pull requests for example) . And having linear history doesn't make any sense for a branch with only a few commits.
The same goes for calculating changesets, culprits etc. None of those features are relevant for short lived branches.

timja commented 11 years ago

jglick:

Changelogs and the like are relevant if there is more than one commit in the branch, which is common in Git pull requests. And SVN branches need not be long lived since it is just as cheap to create and dispose of them as it is in Git. (CVS is a different matter of course.) Whether you care about linear build history depends on your workflow, not the SCM per se.

timja / jenkins-gh-issues-poc-06-18

[JENKINS-9913] Not obvious why some post-build tasks enforce serial behavior even when builds are concurrent #6586