opensearch-project / opensearch-build

🧰 OpenSearch / OpenSearch-Dashboards Build Systems
Apache License 2.0
135 stars 267 forks source link

[FEATURE] Introduce commit queue on Jenkins (`main` branch only) to proactively spot flaky tests #4810

Open reta opened 1 month ago

reta commented 1 month ago

Is your feature request related to a problem? Please describe

The flaky tests are painful problem with no solution in sight (besides just fixing all of them). At least the idea we massaged along with @andrross and @dblock is how we could spot them proactively and revert the changes that introduce more of them.

Describe the solution you'd like

Here is what we would like to try out:

Describe alternatives you've considered

N/A

Additional context

Discussed also today at the 2.15 retro call

reta commented 1 month ago

@peterzhuamazon FYI, would be great to have it implemented, ready to help as well

peterzhuamazon commented 1 month ago

Adding an initial change of switching gradle check to run every 2 hours and add concurrent runs from 20 to 30.

Will need to monitor for tomorrow and see if it is stable.

Thanks.

peterzhuamazon commented 1 month ago

Would need some inputs from @prudhvigodithi on how to link the metrics of gradle check failures to PRs with specific commit once merged. I think we can currently use the autocut issue created due to gradle check failure, and link them to specific PRs if certain failures pass a threshold. We can then gather these PRs information and display them into a dashboard.

Also need some help from @reta on how to run specific tests if given an autocut issue, so that I can implement the new parameters in jenkins lib and allow gradle check workflow to run in full vs partial mode. If the partial mode is quick enough we can increase the run times.

Thanks.

prudhvigodithi commented 1 month ago

Hey @reta and @peterzhuamazon

  • baseline the current list of known flaky tests (those that are marked for automatic retries could be skipped, they don't fail the check), that we could take from Github issues

For this we have the Github issues created which are marked as Flaky tests (failing for a commit that was merged). Irrespective they failed upon retry or not, they are flagged as failures and issue is created.

Also from those that are marked for automatic retries could be skipped, they don't fail the check are we sure upon continuous retry they pass?

  • for every commit on main branch, do 10 gradle check runs (for now), if that takes too long - consider splitting them to gradle internalClusterTest && gradle check -x internalClusterTest

This is possible, taking the HEAD commit and run them periodically and find the tests that failed from main branch

  • if any of the run fails with new tests, link the failure to the offending pull request

Coming to this, lets say upon periodic runs, found a flaky issue on a commit and once found the PR for that commit, still we cant flag that PR as offending as the test failure can be caused by some commit way back in history but now failed randomly (flaky) for this PR commit. There might be or might not be a co-relation for a failed test and with PR changes.

  • mark the offending pull request for potential revert

Ya once we found the right commit that has introduced the flaky test, then yes its easy to raise a revert PR or create an issue to revert the offending commit.

From the issue title, all we want is find the flaky tests ? if so we already have the automation to find the flaky tests and create/update the issue. Are we looking for a mechanism to find the exact PR that introduced this flakiness ?

Thank you @getsaurabh02 @dblock @andrross

peterzhuamazon commented 1 month ago

if so we already have the automation to find the flaky tests and create/update the issue. Are we looking for a mechanism to find the exact PR that introduced this flakiness ?

I think from the baseline if we find a flaky test is linked to a specific PR, we can mark it for revert. The point is indeed trying to find PR that cause the specific flaky from this point on so that new flaky does not get introduced anymore while we fix the old ones.

Thanks.

prudhvigodithi commented 1 month ago

Also what we can think of is If a test fails once within the last X runs, we can mark it as unstable and flaky. A test is considered stable if it passes N times in a row. Then Ignoring flaky tests (which are marked as unstable and flaky) during gradle check helps reduce the impact on PR's, though it can affect reliability since the functionality covered by the flaky test remains untested while it's classified as flaky, we can mark this test as entry criteria for upcoming release, once the test is fixed then it should pass the release entry criteria and move forward with the release.

With this mechanism we are forcing to fix the flaky tests and also for future what ever code is merged we need to make sure the commit passes N times in a row (after merged) and if failed once within the last X runs then plan for a revert. Having this we can ensure the main branch has tests that passed N times in a row.

reta commented 1 month ago

Thanks @prudhvigodithi

Coming to this, lets say upon periodic runs, found a flaky issue on a commit and once found the PR for that commit, still we cant flag that PR as offending as the test failure can be caused by some commit way back in history but now failed randomly (flaky) for this PR commit. There might be or might not be a co-relation for a failed test and with PR changes.

This is correct, giving where we are now - the false positives are possible and it will take some time before we gain any confidence on the potential offending changes here.

Also from those that are marked for automatic retries could be skipped, they don't fail the check are we sure upon continuous retry they pass?

The way it works - some tests are marked as "flaky" in Gradle build and will be retried by Gradle retry plugin. The retried tests do not fail the build (but mark it as UNSTABLE)

From the issue title, all we want is find the flaky tests ? if so we already have the automation to find the flaky tests and create/update the issue. Are we looking for a mechanism to find the exact PR that introduced this flakiness ?

"No" to first and "Yes" to second - we want to find the offender

Also what we can think of is If a test fails once within the last X runs, we can mark it as unstable and flaky.

This is what @andrross diligently does now: the tests are moved to from "flaky" to "unstable" manually, on ad-hoc fashion. I would prefer to keep it this way for now - otherwise we would be moving tests between buckets without any fixes in sight.

peterzhuamazon commented 1 month ago

I think one thing we can do now to quickly baseline in a hybrid fashion is:

  1. Either comment out the flaky tests based on the autocut.
  2. Or mark those flaky tests to use retry plugin.

At least when we have a baseline now, we can decide that any new flaky test introduced after the baseline has to be either blocked or reverted.

And in the meantime we can gather resources to fix those flaky tests if possible.

prudhvigodithi commented 1 month ago

Ya today we have Flaky Tests

WDYT please let me know @reta @peterzhuamazon @andrross @dblock @getsaurabh02

reta commented 1 month ago

Thanks @prudhvigodithi

  • We can 1st save the existing tests in a list.

Why not to collect them from GA issues upon scheduling the job? (once a day)

The revert will not happen if the failure test is already part of the list.

:+1:

We can as well keep updating the list with new failures based on our call or remove the entry from the list once we have the test fixed.

:100:

Now only for the tests that we saved in list, 1st we need to find if they pass N times in row (flaky) or failing once in every X runs (unstable) for this we can leverage the metrics dashboard for trends

I think those would be naturally phased out by https://github.com/opensearch-project/OpenSearch/issues/14475#issuecomment-2206874037 if we construct the list every time we schedule the job.

prudhvigodithi commented 1 month ago

Why not to collect them from GA issues upon scheduling the job? (once a day)

Thanks @reta, also we can do that or as well just run the same query on Metrics Cluster to get the list of tests to exclude 👍.