Automating Flaky Test Detection

siddhantdange commented 10 months ago

Flaky tests unnecessarily waste developer and CI time, impact product stability, and reduce confidence in tests. It seems that the https://github.com/nodejs/node repository suffers from flaky tests, as shown by the context below.

Context:

BuildPulse is a platform for detecting flaky tests, reporting on impact, mitigating against that impact, and ultimately fixing flaky tests. We rely on open source ourselves, so in an effort to give back, we offer BuildPulse to open-source projects for free (up to a certain volume on test results per day).

I'd like to propose using BuildPulse to replace the current flaky test workflow, which seems to be through https://github.com/nodejs/reliability. We can replace this repository by making it easy to search for known flaky tests and build context (links to test flakiness occurrences, error logs, commits, etc).

We can also enhance your current workflow through ncu-ci integrations, GitHub Issues integrations, pull request comments, reports, notifications, badges, and more.

I'm willing to do the integration myself, which largely consists of 2 steps:

Configuring the test framework to export test results to JUnit XML
Upload result to BuildPulse using our buildpulse-action (for GitHub Actions) or test-reporter (Jenkins)

If the integration is helpful, BuildPulse would appreciate a testimonial down the road :)

Happy to answer any questions and interested in what next steps might look like!

benjamingr commented 10 months ago

We track flaky tests in https://github.com/nodejs/reliability/

tniessen commented 10 months ago

FWIW, I believe that detecting flaky tests hasn't been a big issue. "setting them as flaky to make our CI green again", as the referenced tweet put it, is the easy part. Finding people who are willing to fix them, on the other hand, has been rather difficult.

Also, this was merged only recently: https://github.com/nodejs/build/pull/3056

anonrig commented 10 months ago

We track flaky tests in https://github.com/nodejs/reliability/

IMHO, github issues are not a good way of tracking issues that persist in different days. For example, I'd be super interested in learning how long does it take a flaky test to appear or fixed when introduced into the codebase. I don't know if the proposed service supports it or not, but I believe there is no harm in trying a new solution. Maybe a new product might create the dopamine effect we need to promote fixing the issues.

mcollina commented 10 months ago

I would not replace the current infrastructure with a 3rd party system straight away, but I'm not opposed to add it in a non-destructive way, evaluate it, and see how it reacts.

I'm not sure having that data available would give us anything more useful than our current system, but I'd be happy to be proven wrong.

Something important to consider is that sending data to the 3rd party must be disabled during security releases while CI is locked down.

joyeecheung commented 10 months ago

Configuring the test framework to export test results to JUnit XML

I think for Jenkins this does not need additional work, at the end of every test run we already run tap2junit -i test.tap -o out/junit/test.xml.

Upload result to BuildPulse using our buildpulse-action (for GitHub Actions) or test-reporter (Jenkins)

I think that can be done in parallel to reliability report generation, which is done as a daily job scrawling data from Jenkins. I agree with @tniessen that the difficult part is actually fixing the flakes. I think we have more than enough open issues about flaky tests that contain whatever data a service like BuildPulse can collect, but from what I've observed that doesn't help much in terms of actually getting them fixed, some of the flaky test issues in the core repo are years old and still reproduce from time to time, no matter they are marked flaky or not, and we never know if it's caused by a real bug or if it's caused by platform issues, because we just don't have enough volunteers to actually look into them.

I think when marking tests as flaky, we need to actually spend some brain energy into figuring out why it is flaky and why an immediate solution is difficult to come up with & the impact is small enough so we have to put on this band-aid for now. We also need to check that it is not actually a regression (e.g. https://github.com/nodejs/node/issues/49852) (IMO the best solution to potential regression is revert ASAP and only reland when the flake is addressed). That part probably can't be accomplished by automation.

joyeecheung commented 10 months ago

Also I think what would be really useful is having automation to bisect regressions for newly appeared flakes (ideally on affected platforms only) to find the commit that introduce a flake. (Maybe old ones too, they can be bisected just fine). Even better if this can associate flakes with infra updates somehow in case it is caused by infra updates. This would be very helpful in addressing flakes like https://github.com/nodejs/node/issues/49852 (for that it took some human intelligence to find the suspicious commit even though we did not verify it via bisect and just got a lucky guess). I worked with a similar automation in V8 before and I found it amazing. I am not sure if BuildPause supports that though. If it can only collect information based on past test failures but cannot do actual bisects, that’s still not much of an improvement compared to what we already have.

siddhantdange commented 10 months ago

Hi everyone - read through the comments so far, and would like to address some concerns.

I agree that detection on it's own is not sufficient in solving the underlying problem - your GitHub issues track a subset of known flaky tests already. So what additional data can we provide?

We can show (among others, these just directly address identified concerns):

disruptiveness of flaky tests
time consumed by flaky test
when the flaky test was first and last observed (with links to commits + builds)
error messages when the test flaked
charts & reporting around history of these metrics to track trends

These metrics, alerting, and PR comments make sure flaky tests are tracked, prioritized correctly, and fixed.

BuildPulse determines test flakiness if we see a passing and failing result for that test, but no code has changed. We determine if the code has changed by comparing git tree SHAs. Because of this continuous monitoring, we can identify flaky tests as they come up.

mhdawson commented 9 months ago

I would not replace the current infrastructure with a 3rd party system straight away, but I'm not opposed to add it in a non-destructive way, evaluate it, and see how it reacts.

+1 from me on that approach.

I think the main question is if we have somebody from build or who has jenkins access who will volunteer to help with the effort as my guess is that we'll need update build jobs to send the data.

joyeecheung commented 9 months ago

So far I think no one objects to try it out in addition to what we already have. So we just need volunteers to integrate it with Jenkins and/or GitHub actions and see how it goes.

shnooshnoo commented 8 months ago

hey all, i'd like to improve Windows builds of Node CI and fix flaky tests where possible, and really curious how BuildPulse can help here. So was just wondering what's the current status of this?

shnooshnoo commented 7 months ago

@siddhantdange what would we need to have some kind of MVP of this? Anything I can help you with here?

siddhantdange commented 7 months ago

@shnooshnoo 3 steps!

Sign up to generate keys
Configure the tests to export a Junit XML file
Use our github action or binary to send results to BuildPulse using the keys from Step 1

Additionally, our onboarding docs can be found here.

Happy to help here as well, if you can point me to the correct workflow!

StefanStojanovic commented 7 months ago

FYI @nodejs/build I will meet with @siddhantdange on Tuesday, February 6 at 16:00 CET. Anyone who wants is welcome to join. I'll post a meeting summary here afterward.

StefanStojanovic commented 7 months ago

A quick summary after the meeting:

@siddhantdange did a quick demo for me, showcasing the tools. It looks good and it could potentially be useful to us.
Since I've already tried the tool, we went over some questions I had and cleared them up (eg. how to keep additional data with results like platform, machine, etc.).
Since for my PoC, I'm using my local fork, I'm missing some things such as the settings panel which is only available for organization-owned repos. After I set up everything, we can easily change to nodejs/node and see additional settings
The plan, for now, is to set up part of Windows machines to send results to BuildPulse and let it run for some time so we can see what the dashboards will look like. This will require changing node-test-binary-windows-js-suites and node-test-binary-windows-native-suites but only to add additional code to check if the machine is configured to send results and send them if it is. After making a PoC we will reevaluate. The way I see it, if this moves forward, we'd probably want to upload results from jenkins-workspace machines as that's where all of the results accumulate in the end and we'd not have to change all of the sub-jobs generating the junit.xml files.

StefanStojanovic commented 5 months ago

@nodejs/build a PR enabling BuildPulse integration on part of Windows machines in PRs and daily builds is open here.

nodejs / build

Automating Flaky Test Detection #3575

Context:

https://twitter.com/yagiznizipli/status/1714403450358829386