Make test suites for js-ipfs and js-ipfs-api be flaky-free

victorb commented 6 years ago

js-ipfs and js-libp2p are the biggest JS codebases we have and the test suites not only takes long time to run, but are also flaky, meaning they fail randomly.

We should make use of whatever tools we have (mocha's .retry for example) to make them not be flaky. Nothing is worse than after 40 minutes test run, see that one test timed out.

We can consider this solved once you can run tests 10 times for the same commit and always have a successful run (if it was successful the first time) on CI.

travisperson commented 6 years ago

To try and isolate and find what tests are flaky I started to look at the master branch of the js-ipfs project. I created a tool that gathers the failure report data and creates a simple chart to help highlight failures. I calculate a standard deviation using job run number, and only include failures that have a stdev over 1, this eliminates failures that happen consecutively. The more sporadic the failures, the high the stdev.

The rows are order by the total number of failures for the given error / test.

This is the chart for js-ipfs/master for runs 150 (Early July) to 237 (now)

Chart: https://gateway.ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/ Raw Data: https://ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/js-ipfs.master.150.237.json

The following failures seem the most likely candidates for being flaky tests.

should import an exported key – interface-ipfs-core tests .key.import
"before all" hook – Non-test failures
visits all non-fanout links of a root node – pinSet walkItems
should preload MFS root periodically – MFS preload
resolve ipfs.io dns – cli dns daemon off (directly to core)
should not preload if disabled – preload
"after all" hook – pin flush
"after all" hook – preload
"before all" hook – cli file ls daemon on (through http-api)
"before all" hook – ping DHT enabled
"before all" hook – preload
add – cli files daemon off (directly to core)
should get repo stats – interface-ipfs-core tests .repo.stat
should get repo stats – interface-ipfs-core tests .stats.repo
"before all" hook – cli ls daemon on (through http-api)
"before all" hook – interface-ipfs-core tests .swarm.localAddrs
"before all" hook – interface-ipfs-core tests .swarm.peers
"before all" hook – ping DHT disabled
add with cid-version=0 – cli files daemon off (directly to core)
interface-ipfs-core tests .dag.put "before all" hook – interface-ipfs-core tests
interface-ipfs-core tests .files.cat "before all" hook – interface-ipfs-core tests
interface-ipfs-core tests .files.get "before all" hook – interface-ipfs-core tests
"after all" hook – object put
"after all" hook – pinSet walkItems
"before all" hook – interface-ipfs-core tests .swarm.disconnect
3 peers – bitswap transfer a block between
add alias – cli files daemon off (directly to core)
add and wrap with a directory – cli files daemon off (directly to core)
add recursively test – cli files daemon on (through http-api)
files directory (sharding tests) with sharding "after all" hook – files directory (sharding tests)
files directory (sharding tests) with sharding "before all" hook – files directory (sharding tests)
handles multiple hashes – cli pin daemon off (directly to core) ls

Some of these will only fail on certain platforms / versions of nodejs though.

Raw Data: https://ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/js-ipfs.master.single.150.237.json

macos 8.11.3 x18 should import an exported key – interface-ipfs-core tests .key.import
macos 8.11.3 x5 should get repo stats – interface-ipfs-core tests .repo.stat
macos 8.11.3 x4 "before all" hook – interface-ipfs-core tests .swarm.localAddrs
macos 8.11.3 x4 "before all" hook – interface-ipfs-core tests .swarm.peers
macos 8.11.3 x3 "before all" hook – interface-ipfs-core tests .swarm.disconnect
macos 10.4.1 x3 3 peers – bitswap transfer a block between
windows 8.11.3 x3 add alias – cli files daemon off (directly to core)
windows 8.11.3 x3 add recursively test – cli files daemon on (through http-api)
windows 10.4.1 x3 handles multiple hashes – cli pin daemon off (directly to core) ls
windows 10.4.1 x3 lists all pins when no hash is passed – cli pin daemon off (directly to core) ls
windows 10.4.1 x3 recursively (default) – cli pin daemon off (directly to core) add
macos 8.11.3 x2 "after all" hook – interface-ipfs-core tests .repo.stat
macos 8.11.3 x2 "after all" hook – interface-ipfs-core tests .stats.bw
macos 8.11.3 x2 "before all" hook – interface-ipfs-core tests .pingReadableStream
macos 8.11.3 x2 "before all" hook – interface-ipfs-core tests .stats.bitswap
macos 8.11.3 x2 "before all" hook – interface-ipfs-core tests .stats.bwPullStream
macos 8.11.3 x2 "before all" hook – interface-ipfs-core tests .stats.repo
macos 8.11.3 x2 "before all" hook – interface-ipfs-core tests .swarm.connect
macos 10.4.1 x2 2 peers – bitswap transfer a block between
windows 8.11.3 x2 add --silent – cli files daemon off (directly to core)
windows 8.11.3 x2 add directory with trailing slash test – cli files daemon off (directly to core)
macos 8.11.3 x2 should get repo stats (promised) – interface-ipfs-core tests .repo.stat

I'm going to take this information and try to pull out some of the tests that I think we should apply retry logic too. I will also push up the tools I used for generate this information so that we can use it for other projects.

alanshaw commented 6 years ago

This is so awesome, really great to know where to focus our energy.

should preload MFS root periodically – MFS preload
- https://github.com/ipfs/js-ipfs/pull/1551
resolve ipfs.io dns – cli dns daemon off (directly to core)
- Needs a more reliable DNS endpoint or for us to figure out a way to mock this out for tests
"before all" hook/"after all" hook
- Are these all timeout errors? - if so then hopefully some of these will be addressed by increasing the default timeout to 10s in https://github.com/ipfs/js-ipfs/pull/1541

victorb commented 6 years ago

Agree with Alan, awesome work @travisperson!

We should be able to separate test failures because of timeouts (which I think many of these are) compared to "normal" test failures and "exceptional" test failures. Normal test failures would be a test case which has a assertion that is failing, exceptional test failures would be things like yarn install couldn't finish because of a 404 response. Outside of the test suites. I'll make sure the pipelines can handle it (go-ipfs already does this), then it'll be a bit easier to show in the table.

It'll be very useful to show the output of the failure when hovering/clicking on a cell in the table, so we could see directly what's going wrong.

@travisperson can you publish the code for generating this somewhere?

travisperson commented 6 years ago

Published https://github.com/travisperson/jenkins-flake-report

Are these all timeout errors?

I think most of them are.

We should be able to separate test failures because of timeouts

Ya, the data I'm getting from Jenkins has some information we can test for to see if it's a timeout.

It'll be very useful to show the output of the failure when hovering/clicking on a cell in the table, so we could see directly what's going wrong.

Ya I think that would be great. Currently the script is just a golang html template, but we could at least use a title attribute or something to quickly view some information and extend it further to be more detailed. I originally wrote it as a simple React table but didn't want to deal with all the dependencies and converted it to just plain html.

ipfs-inactive / dev-team-enablement

Make test suites for js-ipfs and js-ipfs-api be flaky-free #127