Re-triggering failed gradle build should run only failed tests

sachinpkale commented 2 months ago

Is your feature request related to a problem? Please describe

Currently, majority of the build failures for OpenSearch repo are due to flaky tests.
Re-triggering the build runs all the tests again (40+ mins) and it is possible that next time build fails due to another flaky test.
This impacts contributor's productivity and slows down the feature work.
Build re-trigger also adds extra stress on the build infra and this gets amplified as we get closer to code freeze date for a release.

Describe the solution you'd like

If a build fails due to test failure, re-triggering the build should run only the failed tests from the previous run.
It would be safe to re-run only failed tests. If the change in PR is actually causing a test to fail, it would always fail.
This will help in reducing the build re-trigger time considerably.
The incremental test run should only be applicable to re-trigger of failed build. Any new change pushed to PR should run the entire build.

Describe alternatives you've considered

Alternative solution to the mentioned issue is to bring down number of flaky tests to 0.
Even though we continuously try to reduce flaky tests, with each new feature, new tests are added and it is possible that new flaky tests are introduced (this happens even with multiple runs of tests on local setup)
Tests with random wait (or assertBusy) are more susceptible to being flaky as most of the these tests do not show any symptom on local and fail when running on GitHub build system (mostly due to overloaded build servers).
We will continue our efforts to reduce flaky tests, but aiming for 0 flaky tests may take months and does not guarantee that a new flaky test will not be introduced.

Additional context

No response

rishabh6788 commented 2 months ago

Thanks for creating this issue @sachinpkale, I was thinking about creating a similar issue in OpenSearch repo to gather feedback from all the contributors of OS repo on how to tackle this problem. While refactoring gradle-check and break it into parallel ci runs seems to be most optimal solution in the long run, we do need some mechanism to reduce the churn due to flakey test failures.

The solution I was thinking about is as follows:

Approach 1

The solution could be we catch the failure status in the existing gradle-check yml workflow file and add new steps to only retry the failing tests, if the retry passes the gradle-check would show success. This will not require any custom action from developer side. This will not help in genuine test failures and will extend gradle-check execution time.

Approach 2

gradle-check fails on the PR.
The developer determines whether it is genuine failure or flakey test failures.
If flakey test, the dev adds a comment or label that says re-run failed tests or something similar.
This triggers a workflow (combination of GHA and Jenkins) which gets failing test list from the last failed gradle-check run.
Runs those tests only and publishes result back on the PR.

There will be other checks like if a new commit was published which caused a new gradle-check run and then a comment/label was added then it wouldn't run stating that a gradle-check is already in progress.

But before we go ahead and start talking about solutions, I have a question on what happens to the last failed gradle-check ci rule on the PR. For now a passing gradle-check is a must for the PR to be merged. In the solution proposed, even if we re-run the failing tests and update on the PR that they passed, the status of gradle-check would still show failed.

Do we then relax this rule? Not required if we go with approach 1.

@reta @getsaurabh02 @prudhvigodithi @gaiksaya @dblock @peterzhuamazon Thoughts? Both approaches are in theory and need to be tested for feasibility. Would like to hear any other ideas to tackle this problem more efficiently.

shiv0408 commented 2 months ago

+1 for breaking the gradle-check into multiple parallel ci runs

reta commented 2 months ago

Thanks @sachinpkale @rishabh6788 , I would also agree with @shiv0408 that breaking the Gradle check into separate tasks / phases (which could be run in parallel and retried individually) looks like a good first step, if I remember correctly, @andrross also brought this idea some time ago. Unrelated to flaky tests, it would also help with getting better test coverage reports.

If a build fails due to test failure, re-triggering the build should run only the failed tests from the previous run.

This will not work as of today (monolithic Gradle check), for example when there are failures in unit tests in any module, the build fails right away (example here [1]):

* What went wrong:
Execution failed for task ':modules:reindex:test'.
> There were failing tests. See the report at: file:///var/jenkins/workspace/gradle-check/search/modules/reindex/build/reports/tests/test/index.html

Retrying such tests would not be useful since the majority of tests weren't even run. To reliable implement such feature we need to make sure none of the test related tasks / phases were skipped.

[1] https://build.ci.opensearch.org/job/gradle-check/47638/console

prudhvigodithi commented 2 months ago

Thanks @rishabh6788 @reta @sachinpkale @shiv0408, here are some issues from past to Optimize the Gradle check https://github.com/opensearch-project/OpenSearch/issues/1975 https://github.com/opensearch-project/OpenSearch/issues/4053 https://github.com/opensearch-project/OpenSearch/issues/12410 https://github.com/opensearch-project/OpenSearch/issues/2496 https://github.com/opensearch-project/opensearch-build/issues/4810 https://github.com/opensearch-project/opensearch-build/issues/1572

I would vote for breaking the Gradle check into separate tasks / phases which will improve the developer productivity and eventually improve the Core contributions.

I'm also ok with other approaches to run in incremental or just re-run the failed Gradle tests, but again with Gradle task graph dependencies we might eventually trigger more tests that just the targeted tests part of re-try.

As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).

Once we split the gradle check and optimize it, down the lane (or in parallel) we can tackle the existing and take a call if we have to mute them for short term and make them the entry criteria for the upcoming release and get them fixed. Some more thoughts here https://github.com/opensearch-project/opensearch-build/issues/4810#issuecomment-2206798742.

@cwperks You may be interested in this conversation as well.

rishabh6788 commented 2 months ago

As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).

I have created an issue for the same to discuss potential solutions in https://github.com/opensearch-project/opensearch-build/issues/5008. @prudhvigodithi @reta appreciate your feedback and comments.

opensearch-project / opensearch-build