scala / scala-jenkins-infra

A Chef cookbook that manages Scala's CI infrastructure.
https://scala-ci.typesafe.com
Apache License 2.0
14 stars 17 forks source link

Increase parallelism in CI #239

Closed retronym closed 6 years ago

retronym commented 6 years ago

Rather than having Jenkins run the monolithic scripts/job/validate/test script, let's have it run components of that in parallel. That would allow us to better exploit the breadth of our build cluster for a single, uncontended validation job, and hopefully get the best-case feedback time well under 30 minutes.

The recent changes to the definition of validate-main are take a step in this direction. All scripts matching test,test/*,integrate/ide are turned into parallel stages of the Jenkins pipeline.

However, we also need to weigh up the extra cost of this approach: repeating the compile step in each of those tasks.

In https://github.com/retronym/hoarder-jenkins-demo, I demonstrate how to use an SBT plugin to reuse compiled classes across multiple, parallel Jenkins workspaces. It is harder than it sounds because of SBT's propensity to recompile when the location of classfile in the new workspace doesn't match the metadata ("Analysis report") it recorded in the previous compile. hoarder patches that metadata with the new paths.

There may be different/simpler ways to convince SBT not to recompile in our simple use case (we want the already-compiled projects to be locked/read-only, so maybe just adding a new compilation policy to SBT would suffice. But leaving the un-patched metadata in place might mess up subsequent SBT tasks that use it.

It may be worthwhile migrating to an in-repository Jenkinsfile pipeline definition before attempting to integrate hoarder. That will let us evolve our SBT build definition in concert with the pipeline definition.

Note: the approach of having the pipeline glob for test scripts and spawn parallel stages dynamically for each is really just a stepping stone towards #240, which would have the pipeline definition in scala/scala. We could then "hard code" the test granularity in the pipeline definition, and even just have it call sbt directly rather than indirecting through the scripts/job/**/* scripts.

Parallel stages in Jenkins can be defined either to fail-fast or not. I propose that we configure all but the tip of the PR to fail fast to save resources.

retronym commented 6 years ago

A simpler alternative to Hoarder might be to use SBT 1.x's new IncOptions.withEnabled option to disable the incremental compiler. If that's not suitable, maybe we can change Zinc to add the compile mode we're after.

retronym commented 6 years ago

Even better, I found skip in compile := true, already in SBT 0.13.x, which has exactly the semantics we need.

https://github.com/retronym/hoarder-jenkins-demo/blob/simpler-skip-compile/Jenkinsfile

retronym commented 6 years ago

I'm experimenting with a fully parallel Jenkins pipeline using the skip in compile trick.

https://scala-ci.typesafe.com/job/scala-validate-main-experimental/2/

It is probably too parallel at the moment, but I thought it would be useful dial that up to 11 to see what we can learn.

There seems to be some sort of infrastructural problem whereby the .class files of the SBT get out of sync:

[test-partest-presentation] error: error while loading PartestTestListener, /home/jenkins/workspace/scala-validate-main-experimental/project/target/scala-2.10/sbt-0.13/classes/scala/build/PartestTestListener.class (No such file or directory)
[test-partest-presentation] /home/jenkins/workspace/scala-validate-main-experimental/build.sbt:682: error: scala.build.PartestTestListener does not have a constructor
[test-partest-presentation]     testListeners in IntegrationTest += new PartestTestListener(target.value)
[test-partest-presentation]                                         ^
[test-partest-scaladoc] ok 72 - run/usecase-var-expansion.scala         
[test-partest-scaladoc] 
[test-partest-scaladoc] Exception in thread "Thread-3" java.lang.NoClassDefFoundError: scala/build/PartestTestListener$$anonfun$testEvent$1
[test-partest-scaladoc]     at scala.build.PartestTestListener.testEvent(PartestTestListener.scala:49)
[test-partest-scaladoc]     at sbt.React$$anonfun$react$7.apply(ForkTests.scala:138)
[test-partest-scaladoc]     at sbt.React$$anonfun$react$7.apply(ForkTests.scala:138)
[test-partest-scaladoc]     at scala.collection.immutable.List.foreach(List.scala:318)
[test-partest-scaladoc]     at sbt.React.react(ForkTests.scala:138)
[test-partest-scaladoc]     at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:76)
[test-partest-scaladoc]     at java.lang.Thread.run(Thread.java:748)
[test-partest-scaladoc] Caused by: java.lang.ClassNotFoundException: scala.build.PartestTestListener$$anonfun$testEvent$1
[test-partest-scaladoc]     at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[test-partest-scaladoc]     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[test-partest-scaladoc]     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[test-partest-scaladoc]     ... 7 more
[test-partest-scaladoc] Internal error when running tests: sbt.ForkMain$Run$RunAborted: java.net.SocketException: Broken pipe (Write failed)

Maybe we need to turn on skip in compile in the SBT build to avoid it reaching back into the original workspace to delete classfiles.

retronym commented 6 years ago

A later iteration of scripts, seems to have cleared up those failures, but landed on:

sbt.ForkMain$ForkError: scala.tools.partest.sbt.TestFailedThrowable: output differs
% scalac doc/doc.scala
% <in process execution of presentation/doc> > doc-presentation.log
% diff /home/jenkins/workspace/scala-validate-main-experimental@2/test/files/presentation/doc-presentation.log /home/jenkins/workspace/scala-validate-main-experimental@2/test/files/presentation/doc.check
@@ -1,2 +1 @@
reload: Base.scala, Class.scala, Derived.scala
-Unexpected foo method comment:None
at scala.tools.partest.sbt.SBTRunner.makeStatus(SBTRunner.scala:90)
at scala.tools.partest.sbt.SBTRunner$$anon$1$$anon$2.<init>(SBTRunner.scala:77)
at scala.tools.partest.sbt.SBTRunner$$anon$1.onFinishTest(SBTRunner.scala:73)
at scala.tools.partest.nest.SuiteRunner.runTest(Runner.scala:866)
at scala.tools.partest.nest.SuiteRunner.$anonfun$runTestsForFiles$2(Runner.scala:873)
at scala.tools.partest.package$$anon$2.call(package.scala:134)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

It's possible that this is a latent bug, scaladoc support in the presentatation compiler a flaky area of a flaky component.

retronym commented 6 years ago

Okay, we've got a clean build which was run without contention from other jobs.

The entire pipeline duration was 23minutes. Not really much better than the just letting one executor perform the full job, which is about 30m.

We have a pretty slow sequential start to the build: 4 mins for the publish the core module as the new starr, 5 minutes to recompile everything with that, then we can start the parallel steps.

The creates some queued jobs which triggers workers 2+3 to up, but worker startup takes 2-4minutes. It would be nice if we could signal forthcoming demand to wake them up to be primed and ready when the sequential part was finished. We could schedule some dummy jobs, but that's pretty icky.

We still have to worry about finding the sweet spot of number of executor slots on each worker. I think 3 is too high.

So maybe this mode is easier to reason about in a model with fast container startup, one build per container, etc.

retronym commented 6 years ago

In case we delete that -experimental Job, here's the config I used:

#!groovy

def testJobsPaths = []
def scalaVersion = ""

node("public") {
    stage('publish') {
        try {
            setDisplayName()
            scalaCheckout()
            ansiColor('xterm') {
                runScript("scripts/jobs/validate/publish-core")
            }
            def props = readProperties file: 'jenkins.properties'
            scalaVersion = props['maven.version.number']
            testJobsPaths = findFiles(glob: jobsGlob)

        } finally {
            archiveArtifacts artifacts: 'hs_err_*.log,jenkins.properties', allowEmptyArchive: true
        }
    }
    stage('compile') {
        try {
            env['scalaVersion'] = scalaVersion
            ansiColor('xterm') {
                runScript("scripts/jobs/validate/compile-all")
            }
            stash name: "workspace-stash", include: "**/*"
        } finally {
            archiveArtifacts artifacts: 'hs_err_*.log,jenkins.properties', allowEmptyArchive: true
        }
    }
}

def testStage(scalaVersion, scriptFile) {
    def name = (scriptFile.name == "test") ? "test" : "test-${scriptFile.name}"
    // We need to wrap what we return in a Groovy closure, or else it's invoked
    // when this method is called, not when we pass it to parallel.
    // To do this, you need to wrap the code below in { }, and either return
    // that explicitly, or use { -> } syntax.
    action = { ->
        node("public") { stage(name) {
            try {
                println("Starting stage ${name} to run ${scriptFile} on ${env.NODE_NAME}@${env.EXECUTOR_NUMBER} in ${env.WORKSPACE}")
                unstash name: "workspace-stash"
                env['scalaVersion'] = scalaVersion
                sh """
                echo "skip in compile in ThisBuild := true" > __skipCompile.sbt
                cp __skipCompile.sbt ./project/
                """
                ansiColor('xterm') {
                    runScript(scriptFile)
                }
            }
            finally {
                println("Ending stage ${name} to run ${scriptFile} on ${env.NODE_NAME}@${env.EXECUTOR_NUMBER} in ${env.WORKSPACE}")

                archiveArtifacts artifacts: 'hs_err_*.log,**/test-reports/**/*.xml,build/junit/TEST-*,build/osgi/TEST-*', allowEmptyArchive: true
                junit allowEmptyResults: true, testResults: '**/test-reports/**/*.xml'
            }
        }}
    }
    [name, action]
}

parallel testJobsPaths.collectEntries{testStage(scalaVersion, it)}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// END OF BUILD PROPER
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

def setDisplayName() {
    currentBuild.setDisplayName("[${currentBuild.number}] $repo_user/$repo_name#$_scabot_pr at ${repo_ref.take(6)}")
}

def scalaCheckout() {
    checkout changelog: false, poll: false, scm: [$class: 'GitSCM', branches: [[name: '${repo_ref}']], doGenerateSubmoduleConfigurations: false, extensions: [[$class: 'CleanCheckout']], submoduleCfg: [], userRemoteConfigs: [[name: '${repo_user}', refspec: '+refs/heads/*:refs/remotes/${repo_user}/* +refs/pull/*/head:refs/remotes/${repo_user}/pr/*/head', url: 'https://github.com/${repo_user}/${repo_name}.git']]]
}

def runScript(path) {
    sh """#!/bin/bash -ex
    if [ -f /usr/local/share/jvm/jvm-select ]; then
        . /usr/local/share/jvm/jvm-select;
        jvmSelect $jvmFlavor $jvmVersion;
    else
        echo 'WARNING: jvm-select not present. using system default Java';
    fi
    echo scalaVersion=\$scalaVersion
    echo BASH_VERSION="\$BASH_VERSION"
    . ${path}
    """
}

Together with these changes to scala/scala: https://github.com/retronym/scala/tree/topic/stash

adriaanm commented 6 years ago

Nice! On travis, the start up cost for a VM is < 1 min, and we could reduce it on Jenkins as well by speculatively bringing up all workers whenever we launch one of them.