ipfs-inactive / dev-team-enablement

[ARCHIVED] Dev Team Enablement Working Group
3 stars 1 forks source link

Make test suites for js-ipfs and js-ipfs-api be flaky-free #127

Closed victorb closed 5 years ago

victorb commented 6 years ago

js-ipfs and js-libp2p are the biggest JS codebases we have and the test suites not only takes long time to run, but are also flaky, meaning they fail randomly.

We should make use of whatever tools we have (mocha's .retry for example) to make them not be flaky. Nothing is worse than after 40 minutes test run, see that one test timed out.

We can consider this solved once you can run tests 10 times for the same commit and always have a successful run (if it was successful the first time) on CI.

travisperson commented 6 years ago

To try and isolate and find what tests are flaky I started to look at the master branch of the js-ipfs project. I created a tool that gathers the failure report data and creates a simple chart to help highlight failures. I calculate a standard deviation using job run number, and only include failures that have a stdev over 1, this eliminates failures that happen consecutively. The more sporadic the failures, the high the stdev.

The rows are order by the total number of failures for the given error / test.

This is the chart for js-ipfs/master for runs 150 (Early July) to 237 (now) screenshot_2018-09-06 js-ipfs - master

Chart: https://gateway.ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/ Raw Data: https://ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/js-ipfs.master.150.237.json

The following failures seem the most likely candidates for being flaky tests.

Some of these will only fail on certain platforms / versions of nodejs though.

Raw Data: https://ipfs.io/ipfs/QmdkRvAuDbktNJzEF3W6Sm9GuqRcsTFKMGCExZ3bYUuyoA/js-ipfs.master.single.150.237.json

I'm going to take this information and try to pull out some of the tests that I think we should apply retry logic too. I will also push up the tools I used for generate this information so that we can use it for other projects.

alanshaw commented 6 years ago

This is so awesome, really great to know where to focus our energy.

victorb commented 6 years ago

Agree with Alan, awesome work @travisperson!

We should be able to separate test failures because of timeouts (which I think many of these are) compared to "normal" test failures and "exceptional" test failures. Normal test failures would be a test case which has a assertion that is failing, exceptional test failures would be things like yarn install couldn't finish because of a 404 response. Outside of the test suites. I'll make sure the pipelines can handle it (go-ipfs already does this), then it'll be a bit easier to show in the table.

It'll be very useful to show the output of the failure when hovering/clicking on a cell in the table, so we could see directly what's going wrong.

@travisperson can you publish the code for generating this somewhere?

travisperson commented 6 years ago

Published https://github.com/travisperson/jenkins-flake-report

Are these all timeout errors?

I think most of them are.

We should be able to separate test failures because of timeouts

Ya, the data I'm getting from Jenkins has some information we can test for to see if it's a timeout.

It'll be very useful to show the output of the failure when hovering/clicking on a cell in the table, so we could see directly what's going wrong.

Ya I think that would be great. Currently the script is just a golang html template, but we could at least use a title attribute or something to quickly view some information and extend it further to be more detailed. I originally wrote it as a simple React table but didn't want to deal with all the dependencies and converted it to just plain html.