Closed VachaShah closed 2 years ago
Hey @jainankitk can you provide more details on this issue. It's from https://github.com/opensearch-project/OpenSearch/pull/1445. Thanks.
Hey @jainankitk can you provide more details on this issue. It's from #1445.
I will prioritize this before EOW and update this thread with my findings!
Hey @jainankitk can you provide more details on this issue. It's from #1445.
I will prioritize this before EOW and update this thread with my findings!
Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.
Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.
Thank you, appreciate that!
Another occurrence : https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2122/artifact/gradle_check_2122.log
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=63FE4BB64BECDB9A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sl-SI -Dtests.timezone=America/Blanc-Sablon -Druntime.java=17
org.opensearch.cluster.routing.MovePrimaryFirstTests > testClusterGreenAfterPartialRelocation FAILED
java.lang.AssertionError: timed out waiting for green state
at __randomizedtesting.SeedInfo.seed([63FE4BB64BECDB9A:30342FD360A8A48E]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:985)
at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:924)
at org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation(MovePrimaryFirstTests.java:116)
Looks like the above mentioned commit wasn’t rebased with the latest main
.
Another failure after main is rebased: https://github.com/opensearch-project/OpenSearch/pull/2033
@jainankitk what do you think about increasing the timeout to 120 here? Looks like test is still failing for 60.
Another failure here: https://github.com/opensearch-project/OpenSearch/pull/1917#issuecomment-1028229696
Taking a look
The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes
Another failure here: #2026 comment
Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! :)
1> --------[foo][98], node[gK0yMdlgQcq7XDvBVcHqHA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AHHzR21gRwyH9m8Z2tDMoA], unassigned_info[[reason=NODE_LEFT], at[2022-02-09T01:03:06.551Z], delayed=false, details[node_left [dDqzlwdpR5yo4Eu75qv02Q]], allocation_status[no_attempt]]
1> --------[foo][15], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=YA_tO2W5SlGHclGkPmNptQ]
1> --------[foo][40], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6B01KebASMCgo8EzO8YKZg]
1> --------[foo][6], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=cnF4_Bs4QqCuxxj_HFAqgw]
1> --------[foo][95], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pAGuKsjjR-G3RbvFpvYwQQ]
1> --------[foo][20], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pEms1UhDSxW0LUngYrIsPQ]
1> --------[foo][14], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=aqRw-O9sQ3i3odC_TtwSwg]
1> --------[foo][76], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9pd7AEihRFO-_Cwoa_UAGA]
1> --------[foo][7], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OUN6zdcYTFK5F3MfFQOdrg]
1> --------[foo][81], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=H5BXr0llSS-pR3ai8VrySg]
1> --------[foo][89], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-_-RKj8SyCfKIFnvmQL4w]
1> --------[foo][24], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pq4U5n3tQlOGFmrxcNiJ3Q]
1> --------[foo][19], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qs8kmVy1RAKaRKmCdWJsXQ]
1> --------[foo][59], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=AllB9KA8ST-R6hiADkbp8Q]
1> --------[foo][58], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=Cc110Ve-QAyxnMs8gwN5jQ]
1> --------[foo][26], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6bx8XZ49QZCVLwMAHzFO9A]
1> --------[foo][66], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=m-0y5bnxREajEeNHSDVlPQ]
1> --------[foo][86], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=p0Q47DxfTleki8s42UIqhg]
1> --------[foo][9], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-F_n75mR3KpWbHkMXqBVg]
1> --------[foo][17], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qjWibVYFSFildJbykxn3rg]
1> --------[foo][44], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bA-J1mDcR_u7CdCgneZxzQ]
1> --------[foo][94], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qr73XHAKRKebttLJrfbVPw]
1> --------[foo][11], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=CTR9cjS3QfSFG0qt7VRrbQ]
1> --------[foo][28], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=izZjpNhoS-2AFvj4Y4Y2JQ]
1> --------[foo][30], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fzERcAZ8TemBOeaUKWsMxw]
1> --------[foo][53], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fiMnSwICTN--mEf8Fn3oqQ]
1> --------[foo][71], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=7xbaRZI2QPKdvBrE5iT0JQ]
1> --------[foo][32], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bFplE9pGTY-uTVHdl5RByQ]
1> --------[foo][43], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wsdW3kXSRZakgjBEUoLNOw]
1> --------[foo][52], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=zxDyzd0gTISJfR05fNkZSw]
1> --------[foo][3], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=j_ePRsVJRm2DPaDHu-q-tg]
1> --------[foo][92], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OBJ-9NKNTU6kZQ9R0a0s6A]
1> --------[foo][82], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=5L3LOKOiR5mGoFwWoCxkHg]
1> --------[foo][41], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wTks5SaaQ2a7i74Cy7WwXA]
1> --------[foo][50], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=D2n8YmDQSbab6YaDEmNk-w]
1> --------[foo][33], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=i0GmrglcSyWLeM7jBmtU7Q]
1> --------[foo][57], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LngaeINzSQCPywze9NU 1> 2Rw]
1> --------[foo][18], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=VxHWMVCUThy08tOoJ6VAww]
1> --------[foo][61], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9EPpu_ThT8u1HXDIX-DVzQ]
1> --------[foo][48], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LE_hapDHTdyGqdUm1xNk_Q]
1> --------[foo][51], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=UWCrnlekRJ6kTKufo_cmUg]
1> --------[foo][85], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=0OIL6unHQMyhG0MpuEXrdQ]
1> --------[foo][38], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=ir4h6zK0SDGCzYEEu5QBUw]
1> ---- unassigned
1> pending tasks:
1> tasks: (1):
1> 933/URGENT/shard-started StartedShardEntry{shardId [[foo][98]], allocationId [AHHzR21gRwyH9m8Z2tDMoA], primary term [1], message [after peer recovery]}/36ms
Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?
Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?
I don't think 100 shards is the issue here. I have been able to run the test on my machine several times without any issue. Though thinking more on it, the test might be able to run with even smaller number like 25 or 50 primary shards. I will wait to see if the issue is reported by anyone else.
Okay, I can see that some of the shards were initializing. Considering reducing the number of shards instead of increasing the timeout to not increase overall test suite time
The test was failing due to some replica shards initializing and completing before last primary shard could finish initializing. Discussed the issue with @dblock this morning to make test more predictable. Suggestion was to add shards per node constraint to allow exactly 50% of shards to relocate on new nodes. Including the constraint helped make tests really lightweight (completes in less than 5 seconds) and ran locally without any failure more than 500 times
All above reported failures were for this test before the latest fix.
No recent failures, this issue can be resolved for good - @VachaShah
This issue can be closed. @VachaShah @dblock @andrross
Describe the bug Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.
To Reproduce Steps to reproduce the behavior:
Expected behavior The cluster becomes green and the test does not timeout.
Plugins Core OpenSearch.
Host/Environment (please complete the following information):
1952