Better visibility into test failures over time

andrross commented 6 months ago

I've created a script that crawls the OpenSearch Jenkins builds to find test failures, but only for the Gradle checks that run on code after it is pushed to the main branch. This filters out failures that are due to unmerged code in work-in-progress PRs.

I've included below the output after crawling 2000 recent builds (approx. Oct 16 - Nov 14). This data is very hard to follow, but one thing in particular stands out: SearchQueryIT.testCommonTermsQuery is a frequently failing test, but only since build 29184 (Oct 28). There are no failures before that, which strongly suggests something was changed around Oct 28 that introduced the flakiness. ~I haven't started to look but I suspect we'll be able to find the cause pretty quickly given that there is a point in time to start looking at.~ Update Nov 16: the root cause was an unrelated change for concurrent search randomly increased the number of deleted documents and exposed some underlying brittleness in this test: #11233 Diagnosing the root cause was a bit tricky and required diving into the specifics of how the common terms query works, but it was indeed much simpler once the flakiness was correlated to a small date range and then a specific commit.

Surely there are better tools for visualizing test reports over time, perhaps already built into Jenkins? Also, we don't push that many commits so the sample size on builds after pushes to main isn't that large. Something like a nightly job to run the test suite 10 or 50 or 100 times and create a report on failures would help to quickly surface newly introduced flakiness.

$ ruby ~/flaky-test-finder-push-trigger-main.rb -s 27990 -e 29990

24 org.opensearch.indices.replication.SegmentReplicationIT.testSendCorruptBytesToReplica (28239,28239,28239,28239,28645,28645,28645,28645,28702,28702,28702,28702,28875,28875,28875,28875,28894,28894,28894,28894,28897,28897,28897,28897)
17 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/20_response_filtering/Nodes Stats with response filtering} (28276,28276,28276,28276,28278,28278,28278,28278,28765,28962,28962,28962,28962,28989,28989,28989,28989)
16 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testRequestStats (28259,28259,28259,28259,28276,28276,28276,28276,28316,28316,28316,28316,28368,28368,28368,28368)
12 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.testRequestBreaker {p0={"search.concurrent_segment_search.enabled":"true"}} (28051,28184,28251,28481,28502,28576,28727,28765,28766,28797,28841,28894)
9 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock (28051,28576,28702,28713,28875,28897,29428,29666,29846)
9 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=cat.nodes/10_basic/Test cat nodes output} (28276,28276,28276,28276,28278,28278,28278,28278,28765)
9 org.opensearch.index.shard.RemoteIndexShardTests.classMethod (28716,28716,28897,28897,28966,28966,29666,29666,29666)
8 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.testRequestBreaker {p0={"search.concurrent_segment_search.enabled":"false"}} (28051,28481,28576,28765,28766,28797,28841,28894)
7 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=cat.nodes/10_basic/Additional disk information} (28276,28276,28276,28276,28278,28278,28765)
7 org.opensearch.search.query.SearchQueryIT.testCommonTermsQuery {p0={"search.concurrent_segment_search.enabled":"true"}} (29184,29324,29343,29378,29506,29846,29954)
7 org.opensearch.search.query.SearchQueryIT.testCommonTermsQuery {p0={"search.concurrent_segment_search.enabled":"false"}} (29184,29324,29343,29378,29506,29846,29954)
6 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.classMethod (28797,28797,28797,28841,28841,28841)
6 org.opensearch.cluster.service.MasterServiceTests.testClusterStateBatchedUpdates (28899,28905,28966,28989,28994,29003)
5 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - _all} (28765,28989,28989,28989,28989)
5 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/20_response_filtering/Nodes Stats filtered using both includes and excludes filters} (28278,28278,28278,28278,28989)
5 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/30_discovery/Discovery stats} (28765,28962,28966,28989,28989)
5 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=cat.allocation/10_basic/Node ID} (28276,28276,28276,28276,28278)
4 org.opensearch.cluster.MinimumClusterManagerNodesIT.classMethod (28897,28897,28897,28897)
4 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (28765,28766,29432,29508)
3 org.opensearch.index.shard.RemoteIndexShardTests.testSegRepSucceedsOnPreviousCopiedFiles (28716,28897,28966)
3 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileFetchingMetadata (29070,29132,29274)
3 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.classMethod (29070,29132,29378)
3 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - blank} (28278,28765,28962)
3 org.opensearch.remotestore.RemoteIndexRecoveryIT.testSnapshotRecovery (28481,29432,29655)
3 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled (28502,29561,29666)
3 org.opensearch.indices.replication.SegmentReplicationSuiteIT.testFullRestartDuringReplication (28671,28716,29561)
3 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (28702,28875,29132)
3 org.opensearch.search.aggregations.bucket.DiversifiedSamplerIT.testNestedDiversity {p0={"search.concurrent_segment_search.enabled":"true"}} (28706,28727,29343)
3 org.opensearch.search.aggregations.bucket.DiversifiedSamplerIT.testSimpleDiversity {p0={"search.concurrent_segment_search.enabled":"true"}} (28706,28727,29343)
2 org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata (29595,29655)
2 org.opensearch.index.shard.RemoteIndexShardTests.testRepicaCleansUpOldCommitsWhenReceivingNew (28239,29293)
2 org.opensearch.indices.replication.SegmentReplicationSuiteIT.classMethod (28716,29561)
2 org.opensearch.search.nested.SimpleNestedIT.testSimpleNestedSortingWithNestedFilterMissing {p0={"search.concurrent_segment_search.enabled":"true"}} (28682,29508)
1 org.opensearch.search.profile.query.QueryProfilerTests.testBasic {p0=5} (29044)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (29132)
1 org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimaryMultipleReplicaShards (29132)
1 org.opensearch.remotestore.RemoteStoreStatsIT.testNonZeroPrimaryStatsOnNewlyCreatedIndexWithZeroDocs (29132)
1 org.opensearch.index.reindex.ReindexBasicTests.testMultipleSources (29177)
1 org.opensearch.index.reindex.ReindexBasicTests.testFiltering (29177)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException (29184)
1 org.opensearch.action.admin.indices.create.RemoteShrinkIndexIT.testCreateShrinkIndex (29279)
1 org.opensearch.action.admin.indices.create.RemoteShrinkIndexIT.classMethod (29279)
1 org.opensearch.discovery.ClusterDisruptionIT.classMethod (29293)
1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled (29293)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (29324)
1 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (29324)
1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileSyncingSegments (29378)
1 org.opensearch.search.query.QueryProfilePhaseTests.testTerminateAfterEarlyTermination {p0=5 p1=org.opensearch.search.query.ConcurrentQueryPhaseSearcher@521ba38f} (29417)
1 org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testCreateSplitIndex (29536)
1 org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testCreateSplitIndexToN (29536)
1 org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testSplitFromOneToN (29536)
1 org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testSplitIndexPrimaryTerm (29536)
1 org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.classMethod (29536)
1 org.opensearch.search.SearchWeightedRoutingIT.testShardRoutingWithNetworkDisruption_FailOpenEnabled (29595)
1 org.opensearch.index.shard.RemoteIndexShardTests.testSegmentReplication_With_EngineClosedConcurrently (29666)
1 org.opensearch.index.shard.IndexShardTests.testCommitLevelRestoreShardFromRemoteStore (29729)
1 org.opensearch.index.translog.RemoteFsTranslogTests.testMetadataFileDeletion (28027)
1 org.opensearch.search.query.QueryProfilePhaseTests.testTerminateAfterEarlyTermination {p0=5 p1=org.opensearch.search.query.ConcurrentQueryPhaseSearcher@1d1c37d5} (29821)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteLargeBlob (28051)
1 org.opensearch.search.query.QueryProfilePhaseTests.testTerminateAfterEarlyTermination {p0=5 p1=org.opensearch.search.query.ConcurrentQueryPhaseSearcher@c83ed77} (28521)
1 org.opensearch.search.SearchTimeoutIT.testSimpleTimeout {p0={"search.concurrent_segment_search.enabled":"false"}} (28576)
1 org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimarySingleReplica (28671)
1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testRestoreSnapshotToIndexWithSameNameDifferentUUID (28706)
1 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.testRandomDirectoryIOExceptions {p0={"search.concurrent_segment_search.enabled":"true"}} (28706)
1 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.classMethod (28706)
1 org.opensearch.indices.replication.SegmentReplicationSuiteIT.testBasicReplication (28716)
1 org.opensearch.indices.replication.SegmentReplicationSuiteIT.testDeleteIndexWhileReplicating (28716)
1 org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterStateRestore (28727)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - indexing doc_status} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/50_indexing_pressure/Indexing pressure stats} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - recovery} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/10_basic/Nodes stats level} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/50_indexing_pressure/Indexing pressure memory limit} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - _all include_segment_file_sizes} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - multi} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - indices _all} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - one} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/40_store_stats/Store stats} (28765)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=cat.fielddata/10_basic/Test cat fielddata output} (28765)
1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (28765)
1 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (28797)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (28797)
1 org.opensearch.action.bulk.BulkIntegrationIT.testDocIdTooLong (28797)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (28797)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (28797)
1 org.opensearch.search.functionscore.DecayFunctionScoreIT.classMethod (28813)
1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (28841)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testMultipleSnapshotAndRollback (28875)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (28899)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testContainerCreationAndDeletion (29044)

andrross commented 4 months ago

The request here is very similar to this older issue: #3713

peternied commented 2 months ago

@andrross I created a repo [1] that collects project health information and published reports to its repo ever day - see the latest reports at https://github.com/peternied/contribution-rate?tab=readme-ov-file#reports

One such report is a last 30 days top failing test - here is the March 8 report. It will keep updating every day. Feel free to contribute any kind of reports you'd like to see.

[1] https://github.com/peternied/contribution-rate

prudhvigodithi commented 3 weeks ago

Hey @andrross @peternied we now have the gradle metrics published to OpenSearch Gradle Check Metrics dashboard, this is part of surfacing opensearch-metrics to community. Please check the current supported metrics. How about we expand this and add more metrics as required? and also we use this data for creating triggers like GitHub issues, comments etc. @bbarani @Pallavi-AWS @dblock

andrross commented 2 weeks ago

@prudhvigodithi Do you think it would make sense to add some details about using the gradle check metrics dashboard to help investigate and fix flaky failures in either TESTING.md or DEVELOPER_GUIDE.md?

@dreamer-89 created a great list in #3713:

Identify top hitter for prioritization.
Identify commit introduced a flaky test or increase freq of existing test failure.
Build failure trend to identify health of software.
Developers impacted due to flaky tests.
Test history.

I think if we document somewhere how to use the new dashboard to solve those problems then we can close both of these issues as completed.

prudhvigodithi commented 2 days ago

Thanks @andrross and @dreamer-89, based on the list you have I have modified the gradle check workflow with new fields and created some new visualizations based on the indexed data (Thanks to @rishabh6788 for setting up the initial flow), please check the link OpenSearch Gradle Check Metrics.

Identify top hitter for prioritization

For this I have created a pie chart with the top test_class that has the majority of the failures, this chart should also have the top failing tests within this test_class, we can further slice and dice the data for getting the list of PR's and owners or with post merge, that has top failing tests upon filter.

Screenshot 2024-05-30 at 1 32 04 PM

Identify commit introduced a flaky test or increase freq of existing test failure

The following data tables should have the git commit, the associated PR and the PR owner with all the failing test details, we can filter per PR or commit to get the details of failed tests. The new visualization Gradle Check - Top test class failures with Post Merge also has the flaky test information, its associated commitID and PR (with owner) that was merged with this commitID with post merge (gradle check that ran after the PR is merged) action. We should be able to further drill down with test name or the test class name for more details.

Screenshot 2024-05-30 at 1 09 56 PM Screenshot 2024-05-30 at 1 10 09 PM

Build failure trend to identify health of software

For this the dashboard has a TSVB and line chart with the trend for the failure tests, this can be again further filtered with test name, test class, commitID, PR and with executions with Post merge.

Screenshot 2024-05-30 at 1 14 43 PM

Screenshot 2024-05-30 at 1 14 58 PM

Developers impacted due to flaky tests

The entire visualizations can be filtered with PR owner, PR number or commitID. The results has the hyperlinks for the GitHub PR or commit where one can see the comments and other users. The dashboards also has the PR owner attached to see impacted user. The visualizations also has the hyperlinks with the jenkins build data where one can see all the stack trace details for the failed tests (example 39487).

Screenshot 2024-05-30 at 1 20 22 PM

Test history

All the visualizations in dashboard can be filtered by date range, using OpenSearch we get this out of the box :) With this we can go back and see the trends and infer results based on it.

Screenshot 2024-05-30 at 1 23 52 PM

Adding @peternied @getsaurabh02 @dblock @Pallavi-aws @reta

reta commented 1 day ago

@prudhvigodithi @rishabh6788 it looks great, thank you so much folks for putting it all together

andrross commented 1 day ago

@prudhvigodithi @rishabh6788 it looks great, thank you so much folks for putting it all together

Agreed, this is awesome!

prudhvigodithi commented 1 day ago

Thanks @reta and @andrross, I have a PR created with some details added to the DEVELOPER_GUIDE.md regarding this dashboard https://github.com/opensearch-project/OpenSearch/pull/13919, please check.

prudhvigodithi commented 1 day ago

Next step moving forward for surfacing the test failures as GitHub Issues instead of creating a very generic issue like https://github.com/opensearch-project/OpenSearch/issues/13893 (coming from https://github.com/opensearch-project/OpenSearch/blob/main/.github/workflows/gradle-check.yml#L161-L168) which sometimes fails to execute https://github.com/opensearch-project/OpenSearch/actions/runs/9320653340/job/25657907035, how about we use the following data table information to create a GitHub issue.

Here is the example: After finding the failed tests from Post Merge Actions

We should start by creating an issue at a test class level NestedQueryBuilderTests, link and keep updating all the commits and PR information to the issue created for NestedQueryBuilderTests.

1st to the issue created for NestedQueryBuilderTests, we can link all the post merge failures and commits.

2nd on the same issue for NestedQueryBuilderTests, we can add the failed tests which are part of NestedQueryBuilderTests and Jenkins build information for stacktrace.

3rd on the same issue, we can add other PR's information where this has or has been failing.

I'm open for ideas on whom to assign this created issue? Should we just keep it open without any assignee as each issue will have multiple PR and commits information. later during triaging the maintainer should be able to identify the right team/user and add as assignee.

Moving forward we can have a logic to auto close the created issue if in last 30 days there is no failure for the test class (NestedQueryBuilderTests in above example) found in post merge Gradle Check build and reopen as required.

@andrross @reta @dblock @getsaurabh02 @peternied let me know your thoughts on this.

Thank you

andrross commented 1 day ago

Now that we have the metrics and the updated developer guide, I'm going to close this and issue #3713. If anyone thinks there is more to do here please reopen or open a new issue. Thanks!

opensearch-project / OpenSearch