Closed anasalkouz closed 10 months ago
I did a quick experiment overnight on my dev machine where I ran the internalClusterTest
all night in a loop:
for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew ':server:internalClusterTest' >> test-output.txt 2>&1 ; done
Results:
$ egrep 'BUILD (SUCCESSFUL|FAILED)' test-output.txt | wc -l
152
$ egrep 'BUILD FAILED' test-output.txt | wc -l
3
$ egrep '^REPRODUCE' test-output.txt | less -S | uniq
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=7B8B067879F3C91F -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Brazil/West -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=9F8306D99E2C2EF1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=id -Dtests.timezone=Asia/Aqtau -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=6D39D8439C254FF0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Pacific/Honolulu -Druntime.java=17
All 3 failures were caused by "Suite timeout exceeded (>= 1200000 msec)."
From this I'll make a couple hypotheses:
./gradlew check
often manifest as a failure somewhere in :server:internalClusterTest
, they are not the result of buggy logic within the tests themselves, but instead are the result of interference between gradle tasks running concurrently, or some other problem with the CI environment. (I make this claim because the ~2% failure rate observed in my experiment seems much lower than the failure rate we're observing in the PR checks)I'm going to repeat my experiment but run the full check
task instead of just :server:internalClusterTest
. If hypothesis 2 is correct then I should see a higher failure rate than 3 out of 152 observed in this first experiment.
Dev environment:
Another flaky test: Coming from: https://github.com/opensearch-project/OpenSearch/pull/1725
* What went wrong:
Execution failed for task ':qa:rolling-upgrade:v1.3.0#oldClusterTest'.
> `node{:qa:rolling-upgrade:v1.3.0-0}` failed to wait for ports files after 120000 MILLISECONDS
Looking into it.
A simple plan to begin with can involve below steps:
Analyze. Analyze last X failed Jenkins builds (X=20), identify failed tests and count frequency of failure. This will help in priortizing the right failure.
Reproduce. Failures identified above may need more deep dive for root causes; and also the ability to reproduce those failures locally. The expectation from this step is to have dev setup where failures can be replicated. Begin with targeted test (fast); if it does not help, run entire tests suite (slow). Failures may not always happen so need to repeat the tests multiple times as done by @andrross above. Replication may need setup similar to as used in Jenkins (worst case; have Jenkins setup). Add required logs wherever necessary to deep dive into the issue. Replication may discover new bugs/issues in tests, these failures should be properly documented and fixed as well in order to increase the overall tests stability.
Fix. Fixing tests depends on type of failure and can broadlly be classified in below categories. The step may run in sequence after step 2 or in parallel depending upon failure identified in step 1.
a. True transient failures.
Failures which are happen randomly and are out of our control. For e.g. nodes connection time out happening due to bad node, networking issue etc. The only fix in this case it to either increase corresponding parameters (timeout) or skip the test until a proper fix is identified.
b. Setup related.
There may be class of failures related to mis-configurations (bcwd compatibility tests etc) and easiest one to identify. These tests may need minor configuration changes.
b. Bug fix.
The remaining class of failures are corner cases which are more tricky root cause and may need specific area of expertise. Based on area of failure, required engineer needs to be involved to debug the issue further.
- Analyze last X failed Jenkins builds (X=20)
I think it is a good idea to collect this data. It might be a bit hard to separate out the failures that were caused by the change in the PR that triggered the build. Setting up a test machine to run checks continually should be able to get similar data, and will have the benefit of running against a static code base.
- Reproduce
We've probably seen enough of these to know they aren't reproducable when re-run in isolation. We have open issues with quite a few errors and none of them can be reproduced even when re-running the individual test many many times. I think running the entire test suite is the way to go, but we probably don't need to worry about the Jenkins stuff and can just trigger the ./gradlew check
command directly.
Another one, coming from: https://github.com/opensearch-project/OpenSearch/pull/1766
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=28AD28E1A3FF50C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-PH -Dtests.timezone=Etc/GMT+8 -Druntime.java=15
org.opensearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
java.lang.AssertionError: node_t1: [Tuple [v1=node_t2, v2=null]]
at __randomizedtesting.SeedInfo.seed([28AD28E1A3FF50C7:77AB65EE82248FCB]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.opensearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:253)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1048)
at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1021)
at org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:250)
I ran another experiment over the weekend, the theory being that maybe :qa:mixed-cluster:v1.2.2#mixedClusterTest
was interfering with :server:internalClusterTest
:
for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew clean > /dev/null 2>&1 && ./gradlew :server:internalClusterTest :qa:mixed-cluster:v1.2.2#mixedClusterTest >> ../build-failure-tests/test-output-2021-12-17_2.txt 2>&1 ; done
but the results were 7 failures out of 330, which is in line with the ~2% failure rate of the integ tests in isolation. The failures were:
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=60436199814D8A58 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-CS -Dtests.timezone=Etc/GMT+5 -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=8EC37C710AA42BCE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=no-NO -Dtests.timezone=EET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=B4175006736B7460 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-US -Dtests.timezone=Africa/Casablanca -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=6AF32DFBEB864CEE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=PRC -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=D921821394B6DBAA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-GB -Dtests.timezone=America/Nipigon -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FA529FAA49915455 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SY -Dtests.timezone=AET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FC550CFC70BBB318 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hans-CN -Dtests.timezone=America/Knox_IN -Druntime.java=17
There are likely bugs within ClusterHealthIT, ShardIndexingPressureIT, and ShardIndexingPressureSettingsIT that cause rare failures. But it remains a mystery what is causing ./gradlew check
to fail at a much higher rate in the CI workflow than in these experiments.
1725
I opened https://github.com/opensearch-project/OpenSearch/issues/1793 for this one specifically.
/cc @getsaurabh02
ShardIndexingPressureSettingsIT is a problem child. Can y'all investigate the recurring "Suite timeout exceeded (>= 1200000 msec)."
and see if this is either a real issue with the Indexing Pressure implementation or simply a test cluster resourcing issue when run in the context of the entire check suite?
Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843
Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843
:+1: Also note open PR #1592
I copied some links into the body of this issue... it's quite a list.
another one #2176.
Between gradle check 6786 and 6688 (100 builds) the following tests failed more than once:
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}: 12
org.opensearch.test.rest.ClientYamlTestSuiteIT/test {p0=search/30_limits/Regexp length limit}: 6
org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/test {yaml=search/30_limits/Regexp length limit}: 6
org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests/testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections: 5
org.opensearch.action.support.AutoCreateIndexTests/testParseFailed: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfReplicasIsNonNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotZero: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfRoutingShards: 2
org.opensearch.cluster.routing.allocation.DiskThresholdSettingsTests/testInvalidHighDiskThreshold: 2
org.opensearch.cluster.allocation.AwarenessAllocationIT/testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness: 2
org.opensearch.common.settings.ScopedSettingsTests/testLoggingUpdates: 2
org.opensearch.cluster.coordination.NoClusterManagerBlockServiceTests/testRejectsInvalidSetting: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search/320_disallow_queries/Test disallow expensive queries}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset persistent settings}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search.aggregation/240_max_buckets/Max bucket}: 2
org.opensearch.action.support.AutoCreateIndexTests/testParseFailedMissingIndex: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset transient settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in sniff mode with invalid proxy settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Switch connection mode for configured cluster}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in proxy mode with invalid sniff settings}: 2
org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT/testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=scroll/20_keep_alive/Max keep alive}: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client}: 2
org.opensearch.cluster.coordination.ElectionSchedulerFactoryTests/testSettingsValidation: 2
org.opensearch.common.settings.ScopedSettingsTests/testValidate: 2
org.opensearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests/testChunkSize: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testUpdateOfValidationDependentSettings: 2
org.opensearch.cluster.routing.OperationRoutingTests/testWeightedOperationRoutingWeightUndefinedForOneZone: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client}: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testAllOrNothing: 2
org.opensearch.cluster.metadata.AutoExpandReplicasTests/testInvalidValues: 2
Another ~100 failed once.
I am targeting to close these flakey tests down to zero by Dec 30, 2022. Please if anyone want to help in this effort, feel free to pick one of the flakey test issues in this list
I have added the following 2 issues as a proactive mechanisms to detect flaky test failures and prevent new introduced flaky tests. https://github.com/opensearch-project/OpenSearch/issues/5226 https://github.com/opensearch-project/OpenSearch/issues/5227
I wrote a script to crawl the Jenkins output for unstable builds: https://gist.github.com/andrross/ee07a8a05beb63f1173bcb98523918b9
Below are the results for the last 1000 builds. There is a long tail of tests with a few failures, but the top 4 failures have issues already (#5219, #4212, #5157, #3603).
41 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
23 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492)
22 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315)
17 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555)
11 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
9 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (6790,6828,6965,7220,7256,7315,7361,7447,7543)
8 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (6589,6709,6952,6952,6953,6953,7200,7277)
7 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (6585,7154,7183,7255,7292,7300,7551)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6599,6602,6731,6771)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (6952,6953,7077,7320)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (6711,6711,6711,6952)
4 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (6624,6635,6723,6979)
4 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (7185,7212,7231,7342)
4 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (6591,6591,6952,7201)
4 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6593,6615,6664,6682)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (6766,6953,6956)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (6782,6952,7309)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (6894,7091,7144)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (6738,6792,6825)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (6562,7201)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (6723,6979)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6685,7406)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (6723,6979)
2 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (6723,6979)
2 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (6766,7076)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
2 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (6893,7166)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (7464)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (6606)
1 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6653)
1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (6667)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (6723)
1 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (6747)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (6758)
1 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (6764)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (6781)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (6953)
1 org.opensearch.client.ReindexIT.testReindexTask (6962)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (6970)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (7345)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (7415)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (7464)
@andrross I swear I wrote very similar code to produce https://github.com/opensearch-project/OpenSearch/issues/1715#issuecomment-1310928007, but where did I put it? :) thank you!
Found it! https://github.com/dblock/gradle-checks
Thanks @andrross for the script. I ran @andrross script's to get all flaky tests from past 2 months. (From Sep 30 2022 - Dec 5 2022). Here is the List of 104 flaky tests found:
Will crawl builds from 3600 to 7680
------------------
130 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (3619,3619,3695,3719,3719,3720,3720,3743,3744,3744,3902,3902,4173,4279,4382,4602,4602,4602,4751,4751,4752,4752,4793,4793,4793,4946,4946,4946,5122,5123,5123,5298,5298,5341,5341,5354,5354,5396,5396,5396,5399,5489,5533,5533,5533,5556,5557,5557,5557,5572,5954,5955,5955,6060,6061,6061,6061,6132,6132,6133,6151,6155,6156,6172,6188,6218,6218,6221,6221,6233,6234,6234,6254,6254,6389,6389,6391,6436,6469,6469,6470,6470,6475,6476,6476,6476,6547,6547,6548,6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
38 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (3666,3679,4180,4207,4679,4691,4866,4953,5343,5395,5396,5437,5577,5733,5897,5923,6096,6175,6205,6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555,7563,7612)
38 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (3858,3914,3961,4292,4293,4332,4382,4514,4539,4603,4858,4897,5426,5467,5489,5525,5530,5552,5788,5973,6081,6130,6132,6199,6234,6343,6376,6546,6790,6828,6965,7220,7256,7315,7361,7447,7543,7644)
37 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6028,6199,6350,6350,6351,6359,6359,6365,6365,6365,6371,6399,6399,6411,6413,6413,6415,6436,6436,6436,6458,6458,6468,6547,6547,6554,6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
35 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (3619,3719,4382,4638,4792,5122,5294,5354,5395,5531,5556,5878,6060,6128,6133,6151,6152,6152,6156,6156,6218,6254,6390,6436,6436,6475,6548,6589,6709,6952,6952,6953,6953,7200,7277)
29 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6474,6481,6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315,7596,7596,7611,7617,7617)
25 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492,7651,7651)
17 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermark (3635,3641,3798,3920,3928,4137,4189,4240,4279,4447,4511,4536,4787,4793,4818,4818,5134)
14 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (3658,3695,4453,4599,5142,5347,5740,5858,5894,6183,7185,7212,7231,7342)
14 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (4100,4514,5829,6238,6332,6336,6337,6585,7154,7183,7255,7292,7300,7551)
12 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast (3964,4234,4268,4272,4446,4826,4879,4891,4975,4975,5114,5121)
12 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177,7634,7634,7634,7634)
9 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (3607,3757,3789,3839,4952,6624,6635,6723,6979)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (4638,5556,6151,6156,6952,6953,7077,7320)
8 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6172,6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (3616,4279,4700,4802,5396,6554,6764)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (4450,6156,6390,6711,6711,6711,6952)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (5341,5341,6233,6591,6591,6952,7201)
6 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (3700,3852,6371,6894,7091,7144)
6 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (3715,4173,4293,5557,6259,6781)
6 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLastReverse (4271,4329,4533,5011,5114,5114)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testDecommissionStatusUpdatePublishedToAllNodes (5165,5379,5530,5612,5642,5677)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6356,6359,6599,6602,6731,6771)
6 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177,7634,7634)
5 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (4124,4131,4131,4142,4142)
5 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (4450,4792,6389,6766,7076)
5 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6463,6593,6615,6664,6682)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (3607,3789,4952,6723)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (3619,4042,4281,6758)
4 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (3651,3805,6468,6747)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (3695,6390,6476,6953)
4 org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse (3755,4539,5576,6073)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (3789,4952,6723,6979)
4 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (3932,4946,6391,6667)
4 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirstReverse (4279,4294,4420,4714)
4 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (4320,5358,6893,7166)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (4751,6782,6952,7309)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (5363,6766,6953,6956)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (5908,6738,6792,6825)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testSegmentReplication_Index_Update_Delete (3739,4867,6401)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaRestarts (4420,4889,6401)
3 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (4894,6393,7345)
3 org.opensearch.index.shard.IndexShardIT.testIndexCanChangeCustomDataPath (4953,4953,4953)
3 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (5165,6562,7201)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6241,6685,7406)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6241,6685,7406)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
2 org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse (3618,7628)
2 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (3648,6606)
2 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithBackoff (3802,3821)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.testRandomDirectoryIOExceptions (3814,5399)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.classMethod (3814,5399)
2 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (4178,7415)
2 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirst (4925,4975)
2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (5302,6970)
2 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (5361,5600)
2 org.opensearch.client.ReindexIT.testReindexTask (6007,6962)
2 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6170,6653)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (3778)
1 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithoutBackoff (3821)
1 org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery (3837)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testSplitFromOneToN (4178)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.split/30_copy_settings/Copy settings during split index} (4236)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.shrink/30_copy_settings/Copy settings during shrink index} (4236)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.classMethod (4420)
1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimits (4758)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase (4926)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesLowerGeneration (5234)
1 org.opensearch.cluster.routing.allocation.RemoteShardsMoveShardsTests.testIndexLevelExclusions (5484)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testDeleteBlobs (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testWriteRead (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testRequestStats (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testMultipleSnapshotAndRollback (5620)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringQueryPhaseUsingRequestParameter (5760)
1 org.opensearch.discovery.StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority (5915)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkCommitsMergeOnIdle (6241)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode (6241)
1 org.opensearch.gradle.BuildPluginIT.testInsecureMavenRepository (6406)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase (6430)
1 org.opensearch.search.aggregations.bucket.terms.StringTermsIT.classMethod (6465)
1 org.opensearch.upgrade.DetectEsInstallationTaskTests.testTaskExecution (6537)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (7668)
How flaky acceptable? I closed #6739 after calculating the expected failure rate of a random-alpha-of-length-5 collision at 1 in 19,164. It failed once on run 12,467. It'll probably fail again in a few years. Is that OK?
Closing this campaign.
PRs were blocked by transient gradle check errors multiple times. Provide a plan to stabilize the tests.