Open sachinpkale opened 2 months ago
Thanks for creating this issue @sachinpkale, I was thinking about creating a similar issue in OpenSearch repo to gather feedback from all the contributors of OS repo on how to tackle this problem.
While refactoring gradle-check
and break it into parallel ci runs seems to be most optimal solution in the long run, we do need some mechanism to reduce the churn due to flakey test failures.
The solution I was thinking about is as follows:
The solution could be we catch the failure status in the existing gradle-check
yml workflow file and add new steps to only retry the failing tests, if the retry passes the gradle-check
would show success
. This will not require any custom action from developer side.
This will not help in genuine test failures and will extend gradle-check
execution time.
gradle-check
fails on the PR. re-run failed tests
or something similar. There will be other checks like if a new commit was published which caused a new gradle-check run and then a comment/label was added then it wouldn't run stating that a gradle-check
is already in progress.
But before we go ahead and start talking about solutions, I have a question on what happens to the last failed gradle-check
ci rule on the PR. For now a passing gradle-check is a must for the PR to be merged.
In the solution proposed, even if we re-run the failing tests and update on the PR that they passed, the status of gradle-check
would still show failed
.
Do we then relax this rule? Not required if we go with approach 1.
@reta @getsaurabh02 @prudhvigodithi @gaiksaya @dblock @peterzhuamazon Thoughts? Both approaches are in theory and need to be tested for feasibility. Would like to hear any other ideas to tackle this problem more efficiently.
+1 for breaking the gradle-check
into multiple parallel ci runs
Thanks @sachinpkale @rishabh6788 , I would also agree with @shiv0408 that breaking the Gradle check into separate tasks / phases (which could be run in parallel and retried individually) looks like a good first step, if I remember correctly, @andrross also brought this idea some time ago. Unrelated to flaky tests, it would also help with getting better test coverage reports.
If a build fails due to test failure, re-triggering the build should run only the failed tests from the previous run.
This will not work as of today (monolithic Gradle check), for example when there are failures in unit tests in any module, the build fails right away (example here [1]):
* What went wrong:
Execution failed for task ':modules:reindex:test'.
> There were failing tests. See the report at: file:///var/jenkins/workspace/gradle-check/search/modules/reindex/build/reports/tests/test/index.html
Retrying such tests would not be useful since the majority of tests weren't even run. To reliable implement such feature we need to make sure none of the test related tasks / phases were skipped.
[1] https://build.ci.opensearch.org/job/gradle-check/47638/console
Thanks @rishabh6788 @reta @sachinpkale @shiv0408, here are some issues from past to Optimize the Gradle check https://github.com/opensearch-project/OpenSearch/issues/1975 https://github.com/opensearch-project/OpenSearch/issues/4053 https://github.com/opensearch-project/OpenSearch/issues/12410 https://github.com/opensearch-project/OpenSearch/issues/2496 https://github.com/opensearch-project/opensearch-build/issues/4810 https://github.com/opensearch-project/opensearch-build/issues/1572
I would vote for breaking the Gradle check into separate tasks / phases which will improve the developer productivity and eventually improve the Core contributions.
I'm also ok with other approaches to run in incremental or just re-run the failed Gradle tests, but again with Gradle task graph dependencies we might eventually trigger more tests that just the targeted tests part of re-try.
As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).
Once we split the gradle check and optimize it, down the lane (or in parallel) we can tackle the existing and take a call if we have to mute them for short term and make them the entry criteria for the upcoming release and get them fixed. Some more thoughts here https://github.com/opensearch-project/opensearch-build/issues/4810#issuecomment-2206798742.
@cwperks You may be interested in this conversation as well.
As one quick fix we can update the gradle-check workflow to just run for the latest head commit and cancel all the other running jobs for the same PR, this will reduce some noise on the PR (with failing gradle check comments) and stress on infra (by cancelling the long running old commit gradle runs).
I have created an issue for the same to discuss potential solutions in https://github.com/opensearch-project/opensearch-build/issues/5008. @prudhvigodithi @reta appreciate your feedback and comments.
Is your feature request related to a problem? Please describe
Describe the solution you'd like
Describe alternatives you've considered
assertBusy
) are more susceptible to being flaky as most of the these tests do not show any symptom on local and fail when running on GitHub build system (mostly due to overloaded build servers).Additional context
No response