microsoft / rushstack

Monorepo for tools developed by the Rush Stack community
https://rushstack.io/
Other
5.83k stars 592 forks source link

[rush] Flaky Test Quarantine #4465

Open elliot-nelson opened 8 months ago

elliot-nelson commented 8 months ago

Summary

As a Rush monorepo maintainer, I wish I could offer developers a comprehensive strategy for tracking and addressing flaky tests across all projects.

This could be a feature, coded in Rush, or just a strategy, documented as part of Rushstack, that could be implemented on a company-by-company basis... either is better than nothing!

Details

I'm not married to any particular implementation here, but ideally, the strategy would work across unit tests (written in Jest) and integration tests (the kind you'd write with Cypress or Playwright). This suggests to me it's not a "Jest feature", but at some level higher.

Determining when a test is flaky

A comprehensive strategy must be able to determine flakiness without human intervention. Here's a possible approach:

This is one way you could put flaky tests into a list of "known flaky tests" -- if, during the test phase of a project, you manage to get a specific unit test to both pass and fail on the same compiled code, you've proven it is flaky. If a test is proven flaky enough times (i.e. on N builds, where N = 1, N = 10, some number in between), it is put in quarantine.

Putting tests in quarantine

Once a test has been determined flaky, one approach for dealing with the situation is to put it in quarantine. "Quarantined" tests are tests that CI still runs, but, if they fail, they do not fail the build.

A quarantined test is a big deal for a development team, is (even though it's not deleted) it no longer provides a quality gate. The list of currently quarantined tests could be kept highly visible, for example, using Danger to present them in a PR comment on each PR.

Escaping from quarantine

For a unit test in quarantine to "escape", it must succeed multiple times in a row. You could do a periodic process on a main branch to achieve this, although it offers no benefit to a developer attempting to fix a quarantined test.

From the perspective of a developer tasked with fixing a unit test, an ideal approach would be similar to the "Determining a test is flaky", but in reverse:

The "fixed" state here could vary depending on context -- if a "fixed" test arrives in main, we can remove it from quarantine. If it's a PR build, perhaps a helpful message in a PR comment is more appropriate.

Quarantine implementation

The implementation of a test quarantine is its own topic. I believe such a thing must be tracked outside the main git history of the monorepo to be successful, but that implies deciding exactly how they are stored (does the database of quarantined tests track what branch they are detected in, does it do commit analysis to determine where a fix was introduced, etc.).

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question Answer
@microsoft/rush globally installed version?
rushVersion from rush.json?
useWorkspaces from rush.json?
Operating system?
Would you consider contributing a PR? Maybe!
Node.js version (node -v)?
octogonz commented 8 months ago

In the old days, Microsoft Office had an end-to-end test automation system called Big Button (BB) that performed this sort of analysis. It distinguished between tests that were only invoked manually, versus tests that blocked merging of a branch (so-called Branch Validation Tests or BVT's). How I remember it, in order to enable a test as a BVT, that test had to prove its stability by completing 500 consecutive runs without any failures. The system would also automatically remove a BVT if it was detected to be "flakey", which was based on criteria such as a failing a certain number of times in the main branch.

if, during the test phase of a project, you manage to get a specific unit test to both pass and fail on the same compiled code, you've proven it is flaky

IIRC the BVT flakiness detection did not consider failures in a feature branch, only in the main branch. The rationale is that half-baked source code can cause nondeterministic behavior that is not the fault of the test.

These same flakiness principles probably apply of all kinds of tests, however it's unclear whether a single implementation can handle both unit tests and non-unit tests. (For this topic, the typical non-unit tests would be integration tests, end-to-end tests, and screen diff tests.) While Jest tests are always invoked by Rush and/or Heft, the launching of non-unit tests seems to vary widely across monorepos and even across projects within a single monorepo. I've seen approaches such as:

Rather than trying to design a universal framework, would it make sense to start by solving the problem for Jest only?