This repo is used for tracking flaky tests on the Node.js CI and fixing them.
Current status: work in progress. Please go to the issue tracker to discuss!
Updates should be merged as soon as possible. We can revert or modify afterwards. This repo is mostly for coordination so we need to move fast and reduce the noise.
Make the CI green again.
Taking the last 100 runs, at any given time the green rate is calculated as follows
SUCCESS / (100 - RUNNING - ABORTED)
A GitHub workflow is run every day
to produce reliability reports of the node-test-pull-request
CI and post
it to the issue tracker.
Most work starts with opening the issue tracker of this repository and reading the latest report. If the report is missing, see the actions page for details. GitHub's API restricts the length of issue messages, so whenever the report is too long the workflow can fail to post the issue. But it should still leave a summary in the actions page.
JSTest Failure
section of the latest reliability report.
It contains information about the JS tests that failed more than 1 pull
requests in the last 100 node-test-pull-request
CI runs. The more
pull requests a test fail, the higher it would be ranked, and the more
likely that it is a flake.https://github.com/nodejs/node/commits?since=YYYY-MM-DD
and see if there is any pull request that looks related. If one or
more related pull requests can be found, ping the author or the
reviewer of the pull request, or the team in charge of the
related subsystem in the tracking issue or in private to see if
they can come up with a fix to just deflake the test.If the test has been flaky for more than a month and no one is actively
working on it, it is unlikely to go away on its own, and it's time
to mark it as flaky. For example, if parallel/some-flaky-test.js
has been flaky on Windows in the CI, after making sure that there is an
issue tracking it, open a pull request to add the following entry to
test/parallel/parallel.status
:
[$system==win32]
# https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID>
some-flaky-test: PASS,FLAKY
In the reliability reports, Jenkins Failure
, Git Failure
and
Build Failure
are generally infrastructure issues and can be
handled by the nodejs/build
team. Typical infrastructure
issues include:
Sometimes infrastructure issues can show up in the tests too, for
example tests can fail with ENOSPAC
(No space left on device), and
the machine needs to be cleaned up to release disk space.
Some infrastructure issues can go away on its own, but if the same kind of infrastructure issue has been failing multiple pull requests and persists for more than a day, it's time to take action.
Check out the Node.js build issue tracker
to see if there is any open issue about this. If there isn't,
open a new issue about it or ask around in the #nodejs-build
channel
in the OpenJS slack.
When reporting infrastructure issues, it's important to include
information about the particular machines where the issues happen.
On the Jenkins job page of the failed CI build where the infrastructure
is reported in the logs (not to be confused with the parent build that
trigger the sub build that has the issues), on the top-right
corner, there is normally a line similar to
Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1
.
In this case, test-equinix-ubuntu2004_container-armv7l-1
is the machine having infrastructure issues, and it's important
to include this information in the report.