opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.5k stars 1.74k forks source link

[Meta] Fix random test failures #1715

Closed anasalkouz closed 10 months ago

anasalkouz commented 2 years ago

PRs were blocked by transient gradle check errors multiple times. Provide a plan to stabilize the tests.

andrross commented 2 years ago

I did a quick experiment overnight on my dev machine where I ran the internalClusterTest all night in a loop:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew ':server:internalClusterTest' >> test-output.txt 2>&1 ; done

Results:

$ egrep 'BUILD (SUCCESSFUL|FAILED)' test-output.txt | wc -l
152
$ egrep 'BUILD FAILED' test-output.txt | wc -l
3
$ egrep '^REPRODUCE' test-output.txt | less -S | uniq
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=7B8B067879F3C91F -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Brazil/West -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=9F8306D99E2C2EF1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=id -Dtests.timezone=Asia/Aqtau -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=6D39D8439C254FF0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Pacific/Honolulu -Druntime.java=17

All 3 failures were caused by "Suite timeout exceeded (>= 1200000 msec)."

From this I'll make a couple hypotheses:

  1. There is a bug in the logic of ShardIndexingPressureSettingsIT that sometimes causes it to hang and fail with the overall test timeout. See previous issue where this same failure occurred.
  2. While failures we see in the PR workflows that run ./gradlew check often manifest as a failure somewhere in :server:internalClusterTest, they are not the result of buggy logic within the tests themselves, but instead are the result of interference between gradle tasks running concurrently, or some other problem with the CI environment. (I make this claim because the ~2% failure rate observed in my experiment seems much lower than the failure rate we're observing in the PR checks)

I'm going to repeat my experiment but run the full check task instead of just :server:internalClusterTest. If hypothesis 2 is correct then I should see a higher failure rate than 3 out of 152 observed in this first experiment.

Dev environment:

saratvemulapalli commented 2 years ago

Another flaky test: Coming from: https://github.com/opensearch-project/OpenSearch/pull/1725

* What went wrong:
Execution failed for task ':qa:rolling-upgrade:v1.3.0#oldClusterTest'.
> `node{:qa:rolling-upgrade:v1.3.0-0}` failed to wait for ports files after 120000 MILLISECONDS
dreamer-89 commented 2 years ago

Looking into it.

dreamer-89 commented 2 years ago

A simple plan to begin with can involve below steps:

  1. Analyze. Analyze last X failed Jenkins builds (X=20), identify failed tests and count frequency of failure. This will help in priortizing the right failure.

  2. Reproduce. Failures identified above may need more deep dive for root causes; and also the ability to reproduce those failures locally. The expectation from this step is to have dev setup where failures can be replicated. Begin with targeted test (fast); if it does not help, run entire tests suite (slow). Failures may not always happen so need to repeat the tests multiple times as done by @andrross above. Replication may need setup similar to as used in Jenkins (worst case; have Jenkins setup). Add required logs wherever necessary to deep dive into the issue. Replication may discover new bugs/issues in tests, these failures should be properly documented and fixed as well in order to increase the overall tests stability.

  3. Fix. Fixing tests depends on type of failure and can broadlly be classified in below categories. The step may run in sequence after step 2 or in parallel depending upon failure identified in step 1. a. True transient failures. Failures which are happen randomly and are out of our control. For e.g. nodes connection time out happening due to bad node, networking issue etc. The only fix in this case it to either increase corresponding parameters (timeout) or skip the test until a proper fix is identified. b. Setup related. There may be class of failures related to mis-configurations (bcwd compatibility tests etc) and easiest one to identify. These tests may need minor configuration changes.
    b. Bug fix. The remaining class of failures are corner cases which are more tricky root cause and may need specific area of expertise. Based on area of failure, required engineer needs to be involved to debug the issue further.

andrross commented 2 years ago
  1. Analyze last X failed Jenkins builds (X=20)

I think it is a good idea to collect this data. It might be a bit hard to separate out the failures that were caused by the change in the PR that triggered the build. Setting up a test machine to run checks continually should be able to get similar data, and will have the benefit of running against a static code base.

  1. Reproduce

We've probably seen enough of these to know they aren't reproducable when re-run in isolation. We have open issues with quite a few errors and none of them can be reproduced even when re-running the individual test many many times. I think running the entire test suite is the way to go, but we probably don't need to worry about the Jenkins stuff and can just trigger the ./gradlew check command directly.

saratvemulapalli commented 2 years ago

Another one, coming from: https://github.com/opensearch-project/OpenSearch/pull/1766

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=28AD28E1A3FF50C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-PH -Dtests.timezone=Etc/GMT+8 -Druntime.java=15

org.opensearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
    java.lang.AssertionError: node_t1: [Tuple [v1=node_t2, v2=null]]
        at __randomizedtesting.SeedInfo.seed([28AD28E1A3FF50C7:77AB65EE82248FCB]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.opensearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:253)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1048)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1021)
        at org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:250)
andrross commented 2 years ago

I ran another experiment over the weekend, the theory being that maybe :qa:mixed-cluster:v1.2.2#mixedClusterTest was interfering with :server:internalClusterTest:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew clean > /dev/null 2>&1 && ./gradlew :server:internalClusterTest :qa:mixed-cluster:v1.2.2#mixedClusterTest >> ../build-failure-tests/test-output-2021-12-17_2.txt 2>&1 ; done

but the results were 7 failures out of 330, which is in line with the ~2% failure rate of the integ tests in isolation. The failures were:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=60436199814D8A58 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-CS -Dtests.timezone=Etc/GMT+5 -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=8EC37C710AA42BCE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=no-NO -Dtests.timezone=EET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=B4175006736B7460 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-US -Dtests.timezone=Africa/Casablanca -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=6AF32DFBEB864CEE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=PRC -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=D921821394B6DBAA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-GB -Dtests.timezone=America/Nipigon -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FA529FAA49915455 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SY -Dtests.timezone=AET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FC550CFC70BBB318 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hans-CN -Dtests.timezone=America/Knox_IN -Druntime.java=17

There are likely bugs within ClusterHealthIT, ShardIndexingPressureIT, and ShardIndexingPressureSettingsIT that cause rare failures. But it remains a mystery what is causing ./gradlew check to fail at a much higher rate in the CI workflow than in these experiments.

dblock commented 2 years ago

1725

I opened https://github.com/opensearch-project/OpenSearch/issues/1793 for this one specifically.

nknize commented 2 years ago

/cc @getsaurabh02

ShardIndexingPressureSettingsIT is a problem child. Can y'all investigate the recurring "Suite timeout exceeded (>= 1200000 msec)." and see if this is either a real issue with the Indexing Pressure implementation or simply a test cluster resourcing issue when run in the context of the entire check suite?

andrross commented 2 years ago

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

nknize commented 2 years ago

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

:+1: Also note open PR #1592

reta commented 2 years ago

Few more flaky tests:

dblock commented 2 years ago

I copied some links into the body of this issue... it's quite a list.

penghuo commented 2 years ago

another one #2176.

dblock commented 1 year ago

Between gradle check 6786 and 6688 (100 builds) the following tests failed more than once:

org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}: 12
org.opensearch.test.rest.ClientYamlTestSuiteIT/test {p0=search/30_limits/Regexp length limit}: 6
org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/test {yaml=search/30_limits/Regexp length limit}: 6
org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests/testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections: 5
org.opensearch.action.support.AutoCreateIndexTests/testParseFailed: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfReplicasIsNonNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotZero: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfRoutingShards: 2
org.opensearch.cluster.routing.allocation.DiskThresholdSettingsTests/testInvalidHighDiskThreshold: 2
org.opensearch.cluster.allocation.AwarenessAllocationIT/testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness: 2
org.opensearch.common.settings.ScopedSettingsTests/testLoggingUpdates: 2
org.opensearch.cluster.coordination.NoClusterManagerBlockServiceTests/testRejectsInvalidSetting: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search/320_disallow_queries/Test disallow expensive queries}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset persistent settings}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search.aggregation/240_max_buckets/Max bucket}: 2
org.opensearch.action.support.AutoCreateIndexTests/testParseFailedMissingIndex: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset transient settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in sniff mode with invalid proxy settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Switch connection mode for configured cluster}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in proxy mode with invalid sniff settings}: 2
org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT/testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=scroll/20_keep_alive/Max keep alive}: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client}: 2
org.opensearch.cluster.coordination.ElectionSchedulerFactoryTests/testSettingsValidation: 2
org.opensearch.common.settings.ScopedSettingsTests/testValidate: 2
org.opensearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests/testChunkSize: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testUpdateOfValidationDependentSettings: 2
org.opensearch.cluster.routing.OperationRoutingTests/testWeightedOperationRoutingWeightUndefinedForOneZone: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client}: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testAllOrNothing: 2
org.opensearch.cluster.metadata.AutoExpandReplicasTests/testInvalidValues: 2

Another ~100 failed once.

anasalkouz commented 1 year ago

I am targeting to close these flakey tests down to zero by Dec 30, 2022. Please if anyone want to help in this effort, feel free to pick one of the flakey test issues in this list

anasalkouz commented 1 year ago

I have added the following 2 issues as a proactive mechanisms to detect flaky test failures and prevent new introduced flaky tests. https://github.com/opensearch-project/OpenSearch/issues/5226 https://github.com/opensearch-project/OpenSearch/issues/5227

Poojita-Raj commented 1 year ago
Url Status Group Owner Reproducible Note
[BUG] DecommissionControllerTests.testTimesOut f... Closed decommission andrross
[BUG] AwarenessAttributeDecommissionIT.testNodes... assigned decommission pranikum no,100 passing tests on local
[BUG] Failed Integ test testDecommissionStatusUp... assigned decommission pranikum @imRishN opened and merged a fix - https://github.com/opensearch-project/OpenSearch/pull/4822 - doesn't resolve issue since it's seen since then
[Meta] Fix random test failures untriaged meta issue
[BUG] testCoordinatingPrimaryThreadedUpdateToSha... pending shardIndexing
[BUG] ShardIndexingPressureIT.testShardIndexingP... pending shardIndexing
[BUG] org.opensearch.action.bulk.BulkIntegration... pending yes, failed 2/100 tests
[BUG] org.opensearch.persistent.PersistentTasksE... pending no, 200 tests passing on local
[BUG] Failures with org.opensearch.smoketest.Smo... pending yes
[BUG] DedicatedClusterSnapshotRestoreIT.testInde... assigned xuezhou yes, failed 3/100 tests Xue wrote original test
[BUG] Deterministic failure of AggregationsTests... pending yes
[BUG] flaky test index/80_geo_point/Single point... Closed MixedClusterClientYamlTestSuiteIT
[BUG] Fix flaky test org.opensearch.index.ShardI... assigned shardIndexing rrpasham yes
[CI] flaky test failure - o.o.indices.stats.Inde... pending yes, failed 3/100 tests off by 1 error
[CI] Test Failure org.opensearch.cluster.allocat... pending @imRishN worked on original PR, had a fix out and merged in (https://github.com/opensearch-project/OpenSearch/pull/3646), still seeing failures after that
[BUG] org.opensearch.gateway.QuorumGatewayIT > t... pending no, passing 100 tests
[BUG] org.opensearch.repositories.s3.RepositoryS... untriaged RepositoryS3ClientYamlTestSuiteIT
[BUG] Intermittent test failure - Snapshot and R... untriaged RepositoryS3ClientYamlTestSuiteIT
[BUG] OperationRoutingTests.testWeightedOperatio... pending yes There's one PR out for a fix currently - https://github.com/opensearch-project/OpenSearch/pull/4980 - not sure if it resolves issue
[BUG] org.opensearch.search.aggregations.metrics... pending yes There's one PR out for this - https://github.com/opensearch-project/OpenSearch/pull/4850
[BUG] Fix flaky org.opensearch.search.PitMultiNo... pending PitMultiNode yes, failed 1/100
[CI] o.o.aliases.IndexAliasesIT.testSameAlias fa... pending AcknowledgedResponse failed no
[CI] o.o.gateway.RecoveryFromGatewayIT.testReuse... untriaged No occurences since April, can be closed out?
[BUG] Fix new flaky test org.opensearch.search.D... pending PitMultiNode yes, failed 1/100 times
[CI] o.o.cluster.remote.test.RemoteClustersIT.te... untriaged No occurences since June, can be closed out?
[TEST] Failures in IndexingMemoryControllerTests... untriaged no Not seen since Jan, can be closed out?
[BUG] org.opensearch.discovery.DiscoveryDisrupti... untriaged Only 1 occurence in May
[BUG] org.opensearch.action.admin.cluster.tasks.... untriaged timeout issue
[BUG] :test:logger-usage:test failure flakey tes... untriaged
[BUG] o.o.search.SearchCancellationIT.testCancel... pending SearchCancellationIT no, passed 100 tests
[BUG] node drop on o.o.cluster.routing.allocatio... pending no, passed 100 tests
[CI] o.o.blocks.SimpleBlocksIT.testAddBlockWhile... pending no, passed 100 tests also documented in issue -https://github.com/opensearch-project/OpenSearch/issues/2442
[CI] o.o.versioning.ConcurrentSeqNoVersioningIT.... pending no, passed 100 tests
[CI] flaky test faiure - o.o.indices.recovery.In... pending no, passed 100 tests
[CI] o.o.discovery.SnapshotDisruptionIT.testDisr... pending SnapshotDisruptionIT no, passed 100 times
[BUG] testCancellationDuringQueryPhaseUsingReque... pending SearchCancellationIT no, passed 150 times
[BUG] cluster.routing.PrimaryAllocationIT.testPr... pending no, passed 100 times
[BUG] org.opensearch.search.SearchCancellationIT... pending SearchCancellationIT no, passed 100 times
[BUG] StableMasterDisruptionIT.testStaleMasterNo... pending no, passed 100 times
[CI] flaky test faiure - o.o.upgrades.IndexingIT... untriaged
[BUG] Flaky test failure - v1.2.5#mixedClusterTe... untriaged MixedClusterClientYamlTestSuiteIT
[BUG] Master bootstrap takes time causing interm... pending no, passed 100 tests renamed test
[BUG] ClusterRerouteIT.testDelayWithALargeAmount... untriaged AcknowledgedResponse failed no, passed 100 times
[BUG] Flaky test failure - org.opensearch.blocks... closed Same as #33 -https://github.com/opensearch-project/OpenSearch/issues/2472
[BUG] org.opensearch.snapshots.ConcurrentSnapsho... pending no, passed 100 times
[BUG] testRestartIndexCreationAfterFullClusterRe... pending no,passed 100 times
[BUG] org.opensearch.cluster.routing.allocation.... untriaged
[BUG] org.opensearch.discovery.SnapshotDisruptio... untriaged SnapshotDisruptionIT
[CI] Test failure in "org.opensearch.cluster.coo... untriaged
[BUG] Upgrade cli test failure while detecting e... untriaged
[CI] oldClusterTest fails intermittently untriaged
[BUG] Netty Transport test failing with large re... pending No
[BUG] InstallPluginCommandTests.testOfficialPlug... pending No
[BUG] :distribution:packages:rpm:checkExtraction... pending No
[BUG] Transport NIO test intermittently failing ... pending No
[BUG] :rest-api-spec:yamlRestTest org.opensearch... pending No
[BUG] MinimumMasterNodesIT.testThreeNodesNoMaste... pending No Test doesn't exist? renamed to MinimumClusterManagerNodesIT
[BUG] SharedClusterSnapshotRestoreIT.testSnapsho... pending No
andrross commented 1 year ago

I wrote a script to crawl the Jenkins output for unstable builds: https://gist.github.com/andrross/ee07a8a05beb63f1173bcb98523918b9

Below are the results for the last 1000 builds. There is a long tail of tests with a few failures, but the top 4 failures have issues already (#5219, #4212, #5157, #3603).

41 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
23 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492)
22 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315)
17 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555)
11 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
9 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (6790,6828,6965,7220,7256,7315,7361,7447,7543)
8 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (6589,6709,6952,6952,6953,6953,7200,7277)
7 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (6585,7154,7183,7255,7292,7300,7551)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6599,6602,6731,6771)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (6952,6953,7077,7320)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (6711,6711,6711,6952)
4 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (6624,6635,6723,6979)
4 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (7185,7212,7231,7342)
4 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (6591,6591,6952,7201)
4 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6593,6615,6664,6682)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (6766,6953,6956)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (6782,6952,7309)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (6894,7091,7144)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (6738,6792,6825)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (6562,7201)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (6723,6979)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6685,7406)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (6723,6979)
2 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (6723,6979)
2 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (6766,7076)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
2 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (6893,7166)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (7464)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (6606)
1 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6653)
1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (6667)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (6723)
1 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (6747)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (6758)
1 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (6764)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (6781)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (6953)
1 org.opensearch.client.ReindexIT.testReindexTask (6962)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (6970)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (7345)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (7415)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (7464)
dblock commented 1 year ago

@andrross I swear I wrote very similar code to produce https://github.com/opensearch-project/OpenSearch/issues/1715#issuecomment-1310928007, but where did I put it? :) thank you!

dblock commented 1 year ago

Found it! https://github.com/dblock/gradle-checks

Rishikesh1159 commented 1 year ago

Thanks @andrross for the script. I ran @andrross script's to get all flaky tests from past 2 months. (From Sep 30 2022 - Dec 5 2022). Here is the List of 104 flaky tests found:

Will crawl builds from 3600 to 7680
------------------
130 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (3619,3619,3695,3719,3719,3720,3720,3743,3744,3744,3902,3902,4173,4279,4382,4602,4602,4602,4751,4751,4752,4752,4793,4793,4793,4946,4946,4946,5122,5123,5123,5298,5298,5341,5341,5354,5354,5396,5396,5396,5399,5489,5533,5533,5533,5556,5557,5557,5557,5572,5954,5955,5955,6060,6061,6061,6061,6132,6132,6133,6151,6155,6156,6172,6188,6218,6218,6221,6221,6233,6234,6234,6254,6254,6389,6389,6391,6436,6469,6469,6470,6470,6475,6476,6476,6476,6547,6547,6548,6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
38 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (3666,3679,4180,4207,4679,4691,4866,4953,5343,5395,5396,5437,5577,5733,5897,5923,6096,6175,6205,6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555,7563,7612)
38 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (3858,3914,3961,4292,4293,4332,4382,4514,4539,4603,4858,4897,5426,5467,5489,5525,5530,5552,5788,5973,6081,6130,6132,6199,6234,6343,6376,6546,6790,6828,6965,7220,7256,7315,7361,7447,7543,7644)
37 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6028,6199,6350,6350,6351,6359,6359,6365,6365,6365,6371,6399,6399,6411,6413,6413,6415,6436,6436,6436,6458,6458,6468,6547,6547,6554,6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
35 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (3619,3719,4382,4638,4792,5122,5294,5354,5395,5531,5556,5878,6060,6128,6133,6151,6152,6152,6156,6156,6218,6254,6390,6436,6436,6475,6548,6589,6709,6952,6952,6953,6953,7200,7277)
29 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6474,6481,6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315,7596,7596,7611,7617,7617)
25 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492,7651,7651)
17 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermark (3635,3641,3798,3920,3928,4137,4189,4240,4279,4447,4511,4536,4787,4793,4818,4818,5134)
14 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (3658,3695,4453,4599,5142,5347,5740,5858,5894,6183,7185,7212,7231,7342)
14 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (4100,4514,5829,6238,6332,6336,6337,6585,7154,7183,7255,7292,7300,7551)
12 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast (3964,4234,4268,4272,4446,4826,4879,4891,4975,4975,5114,5121)
12 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177,7634,7634,7634,7634)
9 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (3607,3757,3789,3839,4952,6624,6635,6723,6979)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (4638,5556,6151,6156,6952,6953,7077,7320)
8 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6172,6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (3616,4279,4700,4802,5396,6554,6764)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (4450,6156,6390,6711,6711,6711,6952)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (5341,5341,6233,6591,6591,6952,7201)
6 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (3700,3852,6371,6894,7091,7144)
6 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (3715,4173,4293,5557,6259,6781)
6 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLastReverse (4271,4329,4533,5011,5114,5114)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testDecommissionStatusUpdatePublishedToAllNodes (5165,5379,5530,5612,5642,5677)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6356,6359,6599,6602,6731,6771)
6 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177,7634,7634)
5 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (4124,4131,4131,4142,4142)
5 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (4450,4792,6389,6766,7076)
5 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6463,6593,6615,6664,6682)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (3607,3789,4952,6723)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (3619,4042,4281,6758)
4 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (3651,3805,6468,6747)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (3695,6390,6476,6953)
4 org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse (3755,4539,5576,6073)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (3789,4952,6723,6979)
4 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (3932,4946,6391,6667)
4 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirstReverse (4279,4294,4420,4714)
4 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (4320,5358,6893,7166)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (4751,6782,6952,7309)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (5363,6766,6953,6956)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (5908,6738,6792,6825)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testSegmentReplication_Index_Update_Delete (3739,4867,6401)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaRestarts (4420,4889,6401)
3 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (4894,6393,7345)
3 org.opensearch.index.shard.IndexShardIT.testIndexCanChangeCustomDataPath (4953,4953,4953)
3 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (5165,6562,7201)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6241,6685,7406)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6241,6685,7406)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
2 org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse (3618,7628)
2 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (3648,6606)
2 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithBackoff (3802,3821)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.testRandomDirectoryIOExceptions (3814,5399)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.classMethod (3814,5399)
2 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (4178,7415)
2 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirst (4925,4975)
2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (5302,6970)
2 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (5361,5600)
2 org.opensearch.client.ReindexIT.testReindexTask (6007,6962)
2 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6170,6653)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (3778)
1 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithoutBackoff (3821)
1 org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery (3837)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testSplitFromOneToN (4178)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.split/30_copy_settings/Copy settings during split index} (4236)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.shrink/30_copy_settings/Copy settings during shrink index} (4236)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.classMethod (4420)
1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimits (4758)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase (4926)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesLowerGeneration (5234)
1 org.opensearch.cluster.routing.allocation.RemoteShardsMoveShardsTests.testIndexLevelExclusions (5484)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testDeleteBlobs (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testWriteRead (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testRequestStats (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testMultipleSnapshotAndRollback (5620)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringQueryPhaseUsingRequestParameter (5760)
1 org.opensearch.discovery.StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority (5915)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkCommitsMergeOnIdle (6241)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode (6241)
1 org.opensearch.gradle.BuildPluginIT.testInsecureMavenRepository (6406)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase (6430)
1 org.opensearch.search.aggregations.bucket.terms.StringTermsIT.classMethod (6465)
1 org.opensearch.upgrade.DetectEsInstallationTaskTests.testTaskExecution (6537)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (7668)
dbwiddis commented 1 year ago

How flaky acceptable? I closed #6739 after calculating the expected failure rate of a random-alpha-of-length-5 collision at 1 in 19,164. It failed once on run 12,467. It'll probably fail again in a few years. Is that OK?

anasalkouz commented 10 months ago

Closing this campaign.