Parallize aegir scripts

victorb commented 6 years ago

I made a quick test to see if it's viable to parallize the different aegir scripts (branch here: https://github.com/ipfs/jenkins-libs/blob/parallize/vars/javascript.groovy). Reason why is because ipfs/js-ipfs jobs takes ~20 minutes from start to finish. The tests in ipfs/js-ipfs first runs nodejs tests, then browser and after that webworkers. Experiment was to run build once for each platform + nodejs version, then use that build for use for each one of the tests + os + nodejs version.

Conclusion: doesn't actually speed up builds significantly

because we need one job per os + per nodejs version + aegir command, we end with 12 jobs for running tests
this seems to slow down the jenkins pipeline because stash/unstash (which we use to pass around files) is not working well with large directories, making each job spending ~2 minutes just transfering files first
since tests are not isolated, we can only run one job per worker, making this parallization saturating the queue. We have 5 workers of each OS, which a parallel build of aegir scripts would require at least 6 of each OS (if we run one job). The queue gets full from just one test run
when running 12 jobs at the same time in one stage, the reporting back to the master node is delayed, leading to jobs finishing in 2 minutes, actually not finishing until 5 minutes after finishing
specific to ipfs/js-ipfs, npm run test:node is still the slowest one and slows down the complete reporting after pipeline finished.

travisperson commented 6 years ago

I'm going to walk through some details of how tests are executed in AEgir currently just to make sure no details are missed. A lot of this is probably common knowledge, sorry if this riddle with details.

When it comes to parallelization, AEgir does not have a concept of test suites. The only concept that it has parallelization around are targets, but currently this parallelization is turned off due to a hard coded concurrent execution limit set to 1. Increasing this value though doesn't do anything different than this groovy script. It simply allows multiple targets to run concurrently.

In this js-ipfs project there are three named test targets, test:node, test:browser, test:webworker.

When running test:node, each of the separate suites of tests defined in the package.json (test:node:core, test:node:http, test:node:gateway, test:node:cli), are ran serially. AEgir does not make a distinction between these as it's not aware of them.

I ran each suite parallel to each other (using a simply shell script), each row is a single run.

Test Run	core	http	gateway	cli	Total
1	24.40s	113.33s	3.88s	491.06s	632.67s
2	23.52s	114.36s	5.13s	501.95s	644.96s
3	24.39s	114.07s	6.98s	491.91s	637.35s
4	24.98s	112.91s	5.07s	491.04s	634.00s
5	23.86s	113.06s	5.59s	490.48s	632.99s
6	23.86s	114.57s	5.29s	492.69s	636.41s
7	22.40s	113.37s	8.79s	493.52s	638.08s
8	23.83s	113.34s	9.97s	498.46s	645.60s
9	21.07s	114.43s	5.66s	482.90s	624.06s
10	22.11s	114.49s	4.60s	506.10s	647.30s
Avg	23.44s	113.80s	6.10s	494.01s	637.34s

The test:node:cli suite takes the longest time. This is probably in part, as many tests are run in both online and offline modes. This means then on average the test suite runs ~ 318s in either mode.

So over all, there isn't a huge advantage to breaking these and running them concurrently on the same worker. The cli tests dominate the time currently.

The rest of this posting goes into some depth as to why the cli tests are so slow.

The longest tests of the cli, almost 30%, comes from the following three tests

do not crash if Addresses.Swarm is empty (66827ms)
should handle SIGINT gracefully (65188ms)
should handle SIGTERM gracefully (63033ms)

If we remove these tests, the cli tests are then running around ~ 442s, or ~ 221s in a single mode on average.

A lot of the cli tests (even after the daemon is running) take on average it appears upwards of 800ms. This appears to mostly be due to the start up time of the cli.

I ran a quick test, and it will take ~ 850ms (matching the cli test speed) for a full run of a command. The the start of code execution to the process exit, averaged around ~ 250ms, which means that around ~ 600ms is just parsing and loading modules.

I was able to measure this simply wrapping the main require statements of cli.js.

diff --git a/src/cli/bin.js b/src/cli/bin.js
index 1d53444..72a6878 100755
--- a/src/cli/bin.js
+++ b/src/cli/bin.js
@@ -2,11 +2,13 @@

 'use strict'

+const st = (new Date).getTime()
 const yargs = require('yargs')
 const updateNotifier = require('update-notifier')
 const readPkgUp = require('read-pkg-up')
 const utils = require('./utils')
 const print = utils.print
+console.log(((new Date).getTime() - st) / 1000)

 const pkg = readPkgUp.sync({cwd: __dirname}).pkg
 updateNotifier({

The test:node:cli tests spawn the cli 201 times. This results in an overhead of ~ 120s for the full test run.

victorb commented 6 years ago

Currently working on this, will add a new npm run test:ci script that will run all tests in parallel.

Todo:

[ ] Make it possible to run test:browser and test:webworker simultaniously, requires fix in aegir to have dynamic ports in Karma, current issue is port collision
[ ] Make junit test reports have a timestamp or something unique, so we can have many test reports for the same area of tests
[ ] Add test:ci script to js-ipfs and make sure it's working properly and faster than current stuff

travisperson commented 6 years ago

Make it possible to run test:browser and test:webworker simultaniously, requires fix in aegir to have dynamic ports in Karma, current issue is port collision

I don't believe this is an issue with Karma itself. I believe Karma can handle a port already in use. When I was looking into some of this parallel work I found that the ipfsd-ctl server was the issue. Both the browser and webworker tests of aegir use the same hooks browser which causes two ipfsd-ctl servers to be started.

Aegir should possibly have two hooks, one for the browser and another for webworker. For js-ipfs itself we can either start two different ipfsd-ctl servers, or share a single instance and keep a ref count.

dryajov commented 6 years ago

Aegir should possibly have two hooks, one for the browser and another for webworker. For js-ipfs itself we can either start two different ipfsd-ctl servers, or share a single instance and keep a ref count.

Agree, we need to start two ipfsd-ctl servers if we want parallel browser and webworker runs.

victorb commented 6 years ago

This issue was moved to ipfs/testing#102

ipfs-inactive / jenkins

Parallize aegir scripts #93