timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-45553] Parallel pipeline execution scales poorly #9247

Closed timja closed 7 years ago

timja commented 7 years ago

Execution of parallel blocks scales poorly for values of N > 100.  With ~50 nodes (each with 4 executors, for a total of ~200 slots), the following pipeline job takes extraordinarily long to execute:

 

def stepsForParallel = [:]
for (int i = 0; i < Integer.valueOf(params.SUB_JOBS); i++) {
  def s = "subjob_${i}" 
  stepsForParallel[s] = {
    node("darwin") {
      echo "hello"
    }
  }
}
parallel stepsForParallel

 

SUB_JOBS   Time (sec)
---------------------
 100 10
 200 40
 300 96
 400214
 500392
 600660
 700960
 800       1500
 900       2220
1000       gave up...

At no point does the underlying system become taxed (CPU utilization is very low, as this is a very beefy system – 28 cores, 128GB RAM, SSDs)

CPU and Thread CPU Time Sampling (via VisualVM) are attached for reference.

 

 

 

 

 


Originally reported by tskrainar, imported from: Parallel pipeline execution scales poorly
  • assignee: jglick
  • status: Closed
  • priority: Critical
  • resolution: Fixed
  • resolved: 2017-08-26T22:19:44+00:00
  • imported: 2022/01/10
timja commented 7 years ago

jglick:

Attach a textual thread dump for analysis please; the screenshots do not tell me anything.

timja commented 7 years ago

jglick:

And what is “Pipeline 2.5” supposed to mean? There are a couple dozen plugins. You need to specify versions.

timja commented 7 years ago

tskrainar:

Hi Jesse.  Tarball attached containing:

 

nodes/master/pipeline-thread-dump.txt

nodes/master/pipeline-timings.txt

nodes/master/thread-dump.txt

plugins/active.txt

 

'submit-tester' is the name of the job (Pipeline code for which is pasted in the original submission comment).

Dump was generated while the build was in flight (about 3 minutes into the run, or roughly halfway (SUB_JOBS==500)), which was triggered immediately after a fresh resart of Jenkins with all other jobs disabled (i.e. nothing else running in the system).

 

JENKINS-45553_20170725.tgz

 

 

 

timja commented 7 years ago

jglick:

nodes/master/pipeline-timings.txt shows most of the time being spent in run but the nodes/master/thread-dump.txt happens to have captured something in saveProgram which is not informative.

timja commented 7 years ago

jglick:

Reproducible at n=1000, with one mock-slave agent with 200 executors. Seems to be a grab-bag of issues. Well-known copyLogs overhead (JENKINS-38381 off the top of my head); related LogActionImpl.isRunning; some step or another still using Guice; core Queue management overhead; HMACConfidentialKey.createMac being too slow for repeated use from ConsoleNote.encodeTo; etc.

timja commented 7 years ago

jglick:

Also Launcher.kill from PlaceholderTask.finish can be slow.

timja commented 7 years ago

jglick:

Also Queue.maintain should not call String taskDisplayName = p.task.getFullDisplayName(); unless logging is enabled!

timja commented 7 years ago

jglick:

Filed PRs for the low-hanging fruit.

timja commented 7 years ago

manschwetus:

Just to note, this is a real PITA, currently our master runs fulltime 100% CPU on 4 cores and consumes close to 4,5G RAM as the scans keep the threads busy, blocking background GC.
It runs 4 builds with 4 active parallel branches each so 16 active branches in sum.

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Oleg Nenashev
Path:
core/src/main/java/hudson/model/Queue.java
http://jenkins-ci.org/commit/jenkins/378199ff846b2b5b19b56a40b43289127f74f9cc
Log:
Merge pull request #2947 from jglick/Queue-opt

JENKINS-45553 Avoid calling Task.getFullDisplayName unless and until we need to

Compare: https://github.com/jenkinsci/jenkins/compare/4df5895b7eb2...378199ff846b

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
core/src/main/java/jenkins/security/HMACConfidentialKey.java
http://jenkins-ci.org/commit/jenkins/564ca0b6d7f50c3b1164f3f1de8dad5c8c89a8cb
Log:
JENKINS-45553 - Cache the Mac so we do not need to constantly recreate it (#2948)

JENKINS-45553 - Cache the Mac so we do not need to constantly recreate it

timja commented 7 years ago

florian_meser:

It seems there are also connections to stages. Something at the end of an stage seems to kind of "reset" this exponential duration. I.e. if you would split SUB_JOBS=800 into four stages each =200 the duration would be far less.

See also JENKINS-45876

timja commented 7 years ago

manschwetus:

I could probably support this observation, as we reuse the parallel worker spawning subparallels construct for each test stage, I thought till now, leaving the enclosing parallel resets behavior.

If interested/relevant I could try to elaborate a bit more about our scheduling construct.

Also it looks like rolling back Supporting APIs to 2.13 reverts the false improvement in JENKINS-40934

timja commented 7 years ago

jglick:

https://github.com/jenkinsci/workflow-support-plugin/pull/38 pretty much fixed the problem for me, so no need to waste time testing the current code.

timja commented 7 years ago

manschwetus:

I get this messages in the log, do they matter?

Aug 01, 2017 1:32:47 PM SEVERE io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
nestedBranches size: 520 not equal to parallelBranchEndNodes: 521
Aug 01, 2017 1:32:47 PM SEVERE io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
nestedBranches size: 1 not equal to parallelBranchEndNodes: 2
timja commented 7 years ago

jglick:

Sounds like a blueocean-plugin bug, not necessarily related.

timja commented 7 years ago

svanoort:

manschwetus were they throwing that error before applying this change?  If not, that would actually make me nervous  jglick - the APIs that BlueOcean class depends on are pretty tightly coupled to the FlowNode APIs, and one should never see more parallel branch starts detected than ends.

timja commented 7 years ago

manschwetus:

Afaik yes, as I haven't installed anything beside BlueOcean since roughly a week, I just tried to downgrade pipeline supporting, but this seems to not have completed/worked as I expected, maybe I have to restart Jenkins to have the downgrade effective.

timja commented 7 years ago

manschwetus:

jglick, in which component and version, the will be in? As there seems to be an issue with 2.72 and you state your self, to not waste time with the first changes.

timja commented 7 years ago

jglick:

Certain fixes are indeed in core, but the most critical is in workflow-api + workflow-support; versions TBD.

timja commented 7 years ago

jglick:

Still have a hotspot related to JENKINS-26132

"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#731]" #1011 daemon prio=5 os_prio=0 tid=0x00007f4dd0010000 nid=0x1a36 runnable [0x00007f4d6bafa000]
   java.lang.Thread.State: RUNNABLE
    at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
    at java.util.LinkedHashSet.(LinkedHashSet.java:169)
    at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.setup(AbstractFlowScanner.java:132)
    at org.jenkinsci.plugins.workflow.graphanalysis.LinearScanner.setup(LinearScanner.java:187)
    at org.jenkinsci.plugins.workflow.graphanalysis.LinearBlockHoppingScanner.setup(LinearBlockHoppingScanner.java:68)
    at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.setup(AbstractFlowScanner.java:172)
    at org.jenkinsci.plugins.workflow.graphanalysis.FlowScanningUtils.fetchEnclosingBlocks(FlowScanningUtils.java:122)
    at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.computeEnclosingLabel(ExecutorStepExecution.java:473)
    at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getEnclosingLabel(ExecutorStepExecution.java:465)
    at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getDisplayName(ExecutorStepExecution.java:422)
    at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getFullDisplayName(ExecutorStepExecution.java:438)
    at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
    at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
    at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
    at hudson.model.Queue.maintain(Queue.java:1571)
    at …

which would best be solved by implementing JENKINS-36547.

timja commented 7 years ago

jglick:

The code slated to be deleted in JENKINS-38381 also remains a hotspot, as previously noted:

"Running CpsFlowExecution[Owner[…]]" …
   java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.putVal(HashMap.java:628)
    at java.util.HashMap.put(HashMap.java:611)
    at java.util.HashSet.add(HashSet.java:219)
    at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
    at java.util.LinkedHashSet.(LinkedHashSet.java:169)
    at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.setup(AbstractFlowScanner.java:132)
    at org.jenkinsci.plugins.workflow.graphanalysis.LinearScanner.setup(LinearScanner.java:187)
    at org.jenkinsci.plugins.workflow.graphanalysis.LinearBlockHoppingScanner.setup(LinearBlockHoppingScanner.java:68)
    at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:251)
    at org.jenkinsci.plugins.workflow.graphanalysis.LinearScanner.findFirstMatch(LinearScanner.java:135)
    at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:274)
    at org.jenkinsci.plugins.workflow.graph.FlowNode.isActive(FlowNode.java:136)
    at org.jenkinsci.plugins.workflow.support.actions.LogActionImpl.getLogText(LogActionImpl.java:123)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.copyLogs(WorkflowRun.java:485)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$600(WorkflowRun.java:134)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:959)
    - locked <0x00000006ca550158> (a java.util.concurrent.atomic.AtomicBoolean)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1221)
    at …
timja commented 7 years ago

svanoort:

jglick RE the first hotspot, what about the suggestion I made last year to reduce the cost of this call? https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/2#r59458780

I'd anticipated nearly a year ago that this would potentially be a performance problem, and think now might be a reasonable time to implement the suggestions there.

RE the second hotspot – what about reusing a single instance of a LinearScanner for logging within a single flow, maybe caching in a transient field?  This reduces some of the setup work, and may play nicer with the CPU cache since the same object is getting hit often from a single thread.

I'd designed the scanners to be reusable for that sort of use case, so would be nice to see if it pays off like expected.

timja commented 7 years ago

jglick:

 what about the suggestion I made last year to reduce the cost of this call

getEnclosingLabel already does some caching. But queue maintenance calls are a different matter.

This reduces some of the setup work

Not sure how—you would still be calling setup with each head. Anyway feel free to play with this kind of micro-optimization later; the patches as they stand make a clear improvement.

timja commented 7 years ago

manschwetus:

Is there anything blocking progress in this matter?

As this is a regression, prior to latest supporting api change our pipeline has finished in an acceptable time frame (~1 day, now it takes ~6 days!!!)

I already tried to rollback supporting-apis, but the dependencies are too tight, it will require rolling back all things around pipeline one by one.

timja commented 7 years ago

florian_meser:

Same question. What is blocking the progress? This fix is crucial for us atm.

Can a statement be made when this issue will be released (at least approx.)? Last release was end of march.

timja commented 7 years ago

svanoort:

florian_meser The fix that was done earlier had a significant bug in it, and the attempt to work around to that caused another performance regression by undoing a fix we did earlier to prevent abusive performance if you have many nodes followed by a lot of parallel branches. 

I believe jglick has arrived at a hybrid solution that will satisfy the requirements in all cases but requires some additional work (not sure on ETA, maybe he can speak to it – might end up falling to someone else).  However this solution should generate far more significant performance improvements for all. 

timja commented 7 years ago

manschwetus:

As this seems to take much longer as expected, is it possible to respin supporting apis 2.14 without JENKINS-40934, as this change has made things seriously worse?

I'm currently checking the option of going back to supporting APIs 2.13, but this means rolling back ppl groovy to 2.29 and also all dependents on ppl groovy >2.29, only fiddling out what to roll back the version to go back to is a serious issue.

Found another solution, as the change making things worse is suspected to be https://github.com/jenkinsci/workflow-support-plugin/commit/4b48f17fc8525a9d2cfb63c763d7ec5070192712 , I cloned supporting apis 2.14 and reverted this merge and build the result as 2.14.1-sth and installed it, now I can verify that reverting this change solves the regression. Of course really fixing this is the best thing to do, but when this needs more time, probably revert this change in a patch release till the real fix is ready.

timja commented 7 years ago

jglick:

Comment about Task.fullDisplayName

timja commented 7 years ago

jglick:

manschwetus please be patient, fixes are under development / review.

timja commented 7 years ago

svanoort:

It looks generally good now, manschwetus – you will probably be able to run the SNAPSHOT build off Jesse's PR branch and see a big improvement, until we've finished dotting the last few i's and crossing t's to cut the ultimate release.

timja commented 7 years ago

jglick:

The pair of PRs has been approved; awaiting abayer or svanoort to cut releases.

timja commented 7 years ago

abayer:

I'm gonna defer to svanoort for those. =)

timja commented 7 years ago

svanoort:

abayer No, that's fine – feel free to cut a release with these.

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Jesse Glick
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImpl.java
http://jenkins-ci.org/commit/workflow-support-plugin/d78f301c4630ad3d9cbd408494ed1ad36e97b132
Log:
JENKINS-38223 Using FlowNode.isActive to eliminate the main overhead in JENKINS-45553.

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImpl.java
src/main/java/org/jenkinsci/plugins/workflow/support/visualization/table/FlowGraphTable.java
src/test/java/org/jenkinsci/plugins/workflow/support/actions/LogActionImplTest.java
http://jenkins-ci.org/commit/workflow-support-plugin/236f06ca40fc019ca4da2cadb4d0804971faa9db
Log:
Merge pull request #38 from jglick/FlowNode.isActive-JENKINS-38223

JENKINS-38223 Using FlowNode.isActive to improve JENKINS-45553

Compare: https://github.com/jenkinsci/workflow-support-plugin/compare/01f9538af15c...236f06ca40fc

timja commented 7 years ago

svanoort:

manschwetus I'd recommend consuming the SNAPSHOT build of this, since it should fully resolve your issues

timja commented 7 years ago

jglick:

I think all planned fixes have been merged, though some remain unreleased.

timja commented 7 years ago

olivergondza:

All core fixes went into 2.72 so it will be part of LTS line 2.73 naturally.

timja commented 7 years ago

manschwetus:

I'm still wating for a new release of supporting-apis.

timja commented 7 years ago

florian_meser:

I'm a little confused. A "clear improvement" was mentioned before. Which of the changes will reveal this "clear improvement" and are they released yet?

The core changes which are already released since 2.72? I've tried the new LTS 2.73.1 and there was no improvement so I guess there are changes in plugins that are not released yet.

Would be nice to somehow get an overview of the status like "which components are affected",  "what's released so far" and "what still needs to be released".

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
http://jenkins-ci.org/commit/workflow-api-plugin/4ba6b42b651d9a5853a32681e9582f9649ad8fa2
Log:
Merge pull request #50 from svanoort/jenkins-27395-block-structure-lookup

JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph

Compare: https://github.com/jenkinsci/workflow-api-plugin/compare/63e8ad0c2715...4ba6b42b651d

timja commented 7 years ago

svanoort:

florian_meser You wait for the release coming over this weekend, or build and install the master branch of the plugins https://github.com/jenkinsci/workflow-api-plugin/commits/master  and https://github.com/jenkinsci/workflow-support-plugin – we had an additional enhancement that was lumped along with it (plus some further optimizations) and needed to be integrated.  It's ready now though and should be live shortly. 

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
http://jenkins-ci.org/commit/workflow-api-plugin/70a05dc4d4a330b797ce370fd587fccfe74355d9
Log:
Revert "Revert "JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph""

This reverts commit 88ffdfc69c43bd4dde21a6578b5ac466999b4fd4.

timja commented 7 years ago

svanoort:

Benchmarks at ~10x as fast, by the way, more likely quite a bit more (hitting some limits on the test system). 

timja commented 7 years ago

svanoort:

Resolved by core + changes released in workflow-api v2.22 and workflow-support 2.15

timja commented 7 years ago

svanoort:

manschwetus / florian_meser We appreciate your patience - releases are now cut and this should reflect a fairly comprehensive improvement for your case plus several related one.  It requires Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15 to be installed to get the full benefits. 

Please give them a try and let us know how they work out for you – based on our testing you should see a tremendous performance improvement from these changes!

timja commented 7 years ago

puneeth_n:

Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15

Jenkins is spamming the logs with this. Jenkins 2.73.1. Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15. How do I fix this?

timja commented 7 years ago

svanoort:

puneeth_n You'll need to open a new bug and provide the pipeline that triggers this error.  It could be an issue in one of two places and there's nothing in that log to permit me to reproduce this.

timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
http://jenkins-ci.org/commit/workflow-cps-plugin/88ffdfc69c43bd4dde21a6578b5ac466999b4fd4
Log:
Revert "JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"