stefantalpalaru commented 5 years ago

Problem

The test suite keeps getting bigger and slower, because we need to add more tests for more features. It will keep getting bigger and slower for as long as the software is in active development.

Assumptions

test suite runtime is always limited by something:
- CI service limits (50 minutes in the Travis CI free tier)
- number of CI jobs running in parallel (2 for Travis, 6 for Azure, 10 Linux jobs + 4 macOS jobs for the Status Jenkins instance - shared with other Status projects)
- developer's willingness to wait while the CI job is queued and executed, in a scenario in which only PRs passing all tests can be merged
we will be adding tests at a higher rate than we can optimise run times by throwing more hardware at the problem, by parallelising tests or optimising the software.

Proposed solutions

Split the test suite

Have a very small and very fast "sanity checks" subset of the test suite running and passing for each PR before merging it.

The bulk of the test suite gets to run out-of-band. When that fails, we get a notification pinpointing the broken commit/PR. By that time, the broken commit and some new ones have been merged.

Only works in a devel branch that can be broken and mended without consequences for the end user. Brings shorter wait times for people making PRs and, at the same time, more complications if an older commit turns to be broken and can't be reverted easily because newer commits have built on top of it while the complete test suite was running.

So the trade-off is between development speed in the common case and developer frustration in some corner cases.

Unexpected benefit: a reduction of PR-skipping, since the time gained by committing directly to devel is now 5 minutes instead of 50.

mratsim commented 5 years ago

Regarding scope

Regarding splitting I'm OK with it but for the following:

continuous fuzzing: have a fuzzer that runs in the night
long-running testnet
measuring performance or energy efficiency

I.e. I think anything that is related to spec, consensus or interoperability should be well-tested before merging.

Regarding productivity

The costly thing about a broken devel are the context switches.

For example, I was in the middle of SSZ debugging and suddenly I have an isDigit proc not existing in chronicles (https://github.com/status-im/nim-chronicles/commit/f403e28f042b1c6a246ef8af36f218981abb4e24), I drop everything and investigate, well here it was my fault as I was on Nim devel during development but then I have to go back to SSZ, put everything back in memory.

Basically if a commit breaks something, I want it to be nicely contained in a single branch so that only one person has to investigate instead of everyone else of the team being forced to context switch.

While you are refining your PR and commits having sudden interference from others is very disrupting, that was the reason why we choose to pin Nim.

I.e. I think breaking the workflow from those that want to merge during the day is more costly than having the one who worked on it wait for 30-60 min

Regarding cost

New parallel builds seems to be the cheapest at Azure-Pipelines at $40 per parallel job + no compromise on Windows/Max/Linux coverage. That will probably be needed as the team grows.

Another thing using all the cores of the machines we build on.

And lastly, we need to actually measure the time where we spend CI time: https://github.com/status-im/nim-beacon-chain/issues/450

stefantalpalaru commented 5 years ago

I was on Nim devel during development

So how is that related to CI testing?

I have another example: I broke the master branch on macOS by enabling --debugger:native. That didn't stop people from pushing other commits or merging PRs, because they saw the macOS CI job as something they didn't break so they didn't feel the need to stop everything and fix it.

Basically if a commit breaks something, I want it to be nicely contained in a single branch so that only one person has to investigate instead of everyone else of the team being forced to context switch.

Even if the price is something like a maximum of 10 PR merges being possible in a day, due to CI waits?

Git is supposed to facilitate fast development. This CI testing latency we impose on it is artificial and, in the medium to long run, detrimental. The kind of stability we ask now from individual PRs/commits is only needed before merging devel into master.

Further more, if people start sticking to their feature branches and delay merging them into devel, we're going to see more merge conflicts like this one: https://github.com/status-im/nim-beacon-chain/pull/191

mratsim commented 5 years ago

I was on Nim devel during development

So how is that related to CI testing?

That's an example of how breakage unrelated to your work is disruptive. I had the same with NimYAML last week and needed to bump it, same thing with Nim 1.0 and shifts/mod changes.

Even if the price is something like a maximum of 10 PR merges being possible in a day, due to CI waits?

Git is supposed to facilitate fast development. This CI testing latency we impose on it is artificial and, in the medium to long run, detrimental. The kind of stability we ask now from individual PRs/commits is only needed before merging devel into master.

The price is that we pay for more parallel jobs or for our own hardware.

Further more, if people start sticking to their feature branches and delay merging them into devel, we're going to see more merge conflicts like this one: #191

That was discussed in Berlin, we don't keep branches alive anymore.

stefantalpalaru commented 5 years ago

The price is that we pay for more parallel jobs or for our own hardware.

https://en.wikipedia.org/wiki/Amdahl%27s_law

Most tests are of the embarrassingly parallel variety, but there's still a significant obligatory serial part, like fetching Git submodules, compiling the test runner, (downloading test fixtures, until recently), gathering test results, etc. This serial part is going to limit the minimum run time.

Then we have outliers - individual tests taking minutes, like some fixture-based tests, or tens of minutes, like a future local sim run for 4 epochs until finalisation.

We're not the first to have this problem. Look at Chromium: https://www.chromium.org/developers/testing/isolated-testing/infrastructure

They first parallelised the load by allowing any contributor to test a local commit on multiple remote test servers: "Before the Try Server, team members were not testing on other platforms than the one they were developing on, causing constant breakage. This helped getting at 50 commits/day."

But 50 commits a day was not enough, so they introduced a commit queue, with automated merging upon test success: "Before the Commit Queue, the overhead of manually triggered proper tests on all the important configuration was becoming increasingly cumbersome. This could be automated and was done. This helped sustain 200 commits/day."

That's not enough either, so they have automated "sharding" of the tests in an AppEngine cluster of test runners, controlled by a queue manager that distributes the jobs and collects the results.

Isn't it easier to deal with breakage in a devel branch?

we don't keep branches alive anymore

You can't just declare it. You have to facilitate a workflow where long-lived feature branches are not necessary.

zah commented 4 years ago

I think we have a lot more room to grow before we ran into the problems that Chromium is facing. We should actively pursue breaking up the test suite into smaller pieces that can run in parallel. Jenkins has some features specifically designed for this. The sequential parts can be optimised with the stash/unstash feature for example and the parallel execution can scale across nodes. So, let's first prepare ourselves for such scaling and consider the more extreme measures only when it's clear that throwing more worker nodes at the problem is not a solution.

mratsim commented 4 years ago

Also Chromium main issue is C++.

stefantalpalaru commented 4 years ago

Also Chromium main issue is C++.

You'd be surprised how similar Nim templates and macros are to C++ templates, when it comes to compile times. On an overclocked FX-8320E@4.3GHz, performance governor:

$ rm -rf nimcache; time make beacon_node
Building: build/beacon_node

real    1m9.906s
user    4m56.011s
sys 0m39.485s

mratsim commented 4 years ago

Closing stale. Though here is a current proposal: https://github.com/status-im/nim-beacon-chain/pull/1693#issuecomment-696637389

CI proposal:

We test minimal testnet in Github Action

We only test mainnet in Jenkins, reducing Jenkins waiting time by ~15min.

We deactivate Azure Windows for now as their machine are too slow at the moment.

We add back ARM64 and maybe PowerPC64 to Travis if possible.

status-im / nimbus-eth2

[RFC] CI testing strategies #548

Problem

Assumptions

Proposed solutions

Split the test suite

Regarding scope

Regarding productivity

Regarding cost