timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-45553] Parallel pipeline execution scales poorly #9247

Closed timja closed 7 years ago

timja commented 7 years ago

Execution of parallel blocks scales poorly for values of N > 100.  With ~50 nodes (each with 4 executors, for a total of ~200 slots), the following pipeline job takes extraordinarily long to execute:

 

def stepsForParallel = [:]
for (int i = 0; i < Integer.valueOf(params.SUB_JOBS); i++) {
  def s = "subjob_${i}" 
  stepsForParallel[s] = {
    node("darwin") {
      echo "hello"
    }
  }
}
parallel stepsForParallel

 

SUB_JOBS   Time (sec)
---------------------
 100 10
 200 40
 300 96
 400214
 500392
 600660
 700960
 800       1500
 900       2220
1000       gave up...

At no point does the underlying system become taxed (CPU utilization is very low, as this is a very beefy system – 28 cores, 128GB RAM, SSDs)

CPU and Thread CPU Time Sampling (via VisualVM) are attached for reference.

 

 

 

 

 


Originally reported by tskrainar, imported from: Parallel pipeline execution scales poorly
  • assignee: jglick
  • status: Closed
  • priority: Critical
  • resolution: Fixed
  • resolved: 2017-08-26T22:19:44+00:00
  • imported: 2022/01/10
timja commented 7 years ago

scm_issue_link:

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
http://jenkins-ci.org/commit/workflow-cps-plugin/c0daeb5ce9ba55e6f51cb6c8db903cc5fbba324b
Log:
Merge pull request #52 from jenkinsci/revert-50-jenkins-27395-block-structure-lookup

Revert "JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"

timja commented 6 years ago

florian_meser:

Hello svanoort, like you mentioned above I just tested the new versions and there definitely is an improvement. I updated short after you wrote that comment and I'm still using those versions. We pretty much rely on this feature since our whole test infrastructure depends on deploying data on nodes for many branches so we pretty much got a 24/7 running Jenkins (-with up to 1-2k executors in queue).

Never the less the scaling can not be considered as stable. We got many tests that need ~2m and wait ~10-15min (worst case) for being processed by Jenkins. Like mentioned in https://issues.jenkins-ci.org/browse/JENKINS-45876 there seems to be kind of an quadratic or exponential correlation. That means even if there is a big improvement it gets to it's limits when crossing this edge.

In my opinion there is still room for further improvements to ensure also large jenkins environments become more effective.

timja commented 6 years ago

svanoort:

florian_meser I agree completely that there is some room for further optimization of massively-parallel pipeline execution – the best place to currently follow the work and investigations is https://issues.jenkins-ci.org/browse/JENKINS-47724 now.  That ticket also includes some concrete advice that may help with your scenario. 

If you'd like to add some quantitative scaling observations to help identify where the bottleneck is, that might be of some assistance – I also expect the work currently in beta release from JENKINS-47170 will help a bit (reduces the per-flownode overheads associated with pipelines quite significantly – that's a small component of parallel execution).

Very likely you'll see a big improvement from the next phase of that work, https://issues.jenkins-ci.org/browse/JENKINS-38381, which was the culprit here for a lot of the nonlinear behaviors – that's slated to be my next strategic push on performance, along with some tactical fixes that may help with your scenario.

timja commented 6 years ago

svanoort:

One other comment: the bottlenecks appears to be only with massive parallels in a single pipeline – if you break your job into smaller ones with fewer parallel branches in each, this overheads per-branch will be less important.

Pipeline is also never going to achieve fully linear scale-out with large numbers of executors, because only some parts of the execution can take full advantage of parallel execution – primarily the shell/batch/powershell steps that should be doing the bulk of work.  Our work is primarily focused on reducing the other overheads so it can spend more time executing those steps. 

Amdahl's Law in spades, basically.

timja commented 6 years ago

florian_meser:

svanoort I'm currently trying to implement some time measurement to get quantitative scaling observations. Currently I don't got much time to spent for that though. As far as I got something i'll let you know.

I don't know if this is offtopic but it seems that another neck breaker just came in. Therefor the question: are there any observation regarding the Meltdown/Spectre Windows7 updates topic which, again, seem to dramatic reduce the performance of our so called "massive parallels in a single pipeline"?

I'm observing a dramatic loss of performance although no changes in our Jenkins-Pipeline were made regarding this symptomatic. With KB4056894 there was definitely a patch containing Meltdown/Spectre topics. I'm quiet curious if I'm the only one who is having this kind of trouble.

timja commented 6 years ago

svanoort:

florian_meser  I'm not sure what the performance impact of the Meltdown/Spectre updates is on Windows - not really set up for scaling tests on Windows, but it might be related to changes in IO performance. 

Please try out the advice I just added in the latest comment on  https://issues.jenkins-ci.org/browse/JENKINS-47724 – this should help considerably.  The last few months have been heavily focused on performance improvements to Pipeline and it should show in a big way.

timja commented 2 years ago

[Originally depends on: JENKINS-38381]

timja commented 2 years ago

[Originally depends on: JENKINS-36547]