opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.8k stars 1.82k forks source link

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

Closed VachaShah closed 2 years ago

VachaShah commented 2 years ago

Describe the bug Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=AF1232B890DC88C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=he-IL -Dtests.timezone=Japan -Druntime.java=17

To Reproduce Steps to reproduce the behavior:

  1. Run the above command in the repo. Its a flaky test.

Expected behavior The cluster becomes green and the test does not timeout.

Plugins Core OpenSearch.

Host/Environment (please complete the following information):

owaiskazi19 commented 2 years ago

Hey @jainankitk can you provide more details on this issue. It's from https://github.com/opensearch-project/OpenSearch/pull/1445. Thanks.

jainankitk commented 2 years ago

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

owaiskazi19 commented 2 years ago

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

jainankitk commented 2 years ago

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

Thank you, appreciate that!

dreamer-89 commented 2 years ago

Another occurrence : https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2122/artifact/gradle_check_2122.log

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=63FE4BB64BECDB9A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sl-SI -Dtests.timezone=America/Blanc-Sablon -Druntime.java=17

org.opensearch.cluster.routing.MovePrimaryFirstTests > testClusterGreenAfterPartialRelocation FAILED
    java.lang.AssertionError: timed out waiting for green state
        at __randomizedtesting.SeedInfo.seed([63FE4BB64BECDB9A:30342FD360A8A48E]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:985)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:924)
        at org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation(MovePrimaryFirstTests.java:116)
dreamer-89 commented 2 years ago

Another occurrence: https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2124/artifact/gradle_check_2124.log

owaiskazi19 commented 2 years ago

Looks like the above mentioned commit wasn’t rebased with the latest main.

saratvemulapalli commented 2 years ago

Another failure after main is rebased: https://github.com/opensearch-project/OpenSearch/pull/2033

owaiskazi19 commented 2 years ago

@jainankitk what do you think about increasing the timeout to 120 here? Looks like test is still failing for 60.

andrross commented 2 years ago

Another failure here: https://github.com/opensearch-project/OpenSearch/pull/1917#issuecomment-1028229696

jainankitk commented 2 years ago

Taking a look

jainankitk commented 2 years ago

The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes

dreamer-89 commented 2 years ago

Another failure here: #2026 comment

jainankitk commented 2 years ago

Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! :)

  1> --------[foo][98], node[gK0yMdlgQcq7XDvBVcHqHA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AHHzR21gRwyH9m8Z2tDMoA], unassigned_info[[reason=NODE_LEFT], at[2022-02-09T01:03:06.551Z], delayed=false, details[node_left [dDqzlwdpR5yo4Eu75qv02Q]], allocation_status[no_attempt]]
  1> --------[foo][15], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=YA_tO2W5SlGHclGkPmNptQ]
  1> --------[foo][40], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6B01KebASMCgo8EzO8YKZg]
  1> --------[foo][6], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=cnF4_Bs4QqCuxxj_HFAqgw]
  1> --------[foo][95], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pAGuKsjjR-G3RbvFpvYwQQ]
  1> --------[foo][20], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pEms1UhDSxW0LUngYrIsPQ]
  1> --------[foo][14], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=aqRw-O9sQ3i3odC_TtwSwg]
  1> --------[foo][76], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9pd7AEihRFO-_Cwoa_UAGA]
  1> --------[foo][7], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OUN6zdcYTFK5F3MfFQOdrg]
  1> --------[foo][81], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=H5BXr0llSS-pR3ai8VrySg]
  1> --------[foo][89], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-_-RKj8SyCfKIFnvmQL4w]
  1> --------[foo][24], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pq4U5n3tQlOGFmrxcNiJ3Q]
  1> --------[foo][19], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qs8kmVy1RAKaRKmCdWJsXQ]
  1> --------[foo][59], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=AllB9KA8ST-R6hiADkbp8Q]
  1> --------[foo][58], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=Cc110Ve-QAyxnMs8gwN5jQ]
  1> --------[foo][26], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6bx8XZ49QZCVLwMAHzFO9A]
  1> --------[foo][66], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=m-0y5bnxREajEeNHSDVlPQ]
  1> --------[foo][86], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=p0Q47DxfTleki8s42UIqhg]
  1> --------[foo][9], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-F_n75mR3KpWbHkMXqBVg]
  1> --------[foo][17], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qjWibVYFSFildJbykxn3rg]
  1> --------[foo][44], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bA-J1mDcR_u7CdCgneZxzQ]
  1> --------[foo][94], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qr73XHAKRKebttLJrfbVPw]
  1> --------[foo][11], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=CTR9cjS3QfSFG0qt7VRrbQ]
  1> --------[foo][28], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=izZjpNhoS-2AFvj4Y4Y2JQ]
  1> --------[foo][30], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fzERcAZ8TemBOeaUKWsMxw]
  1> --------[foo][53], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fiMnSwICTN--mEf8Fn3oqQ]
  1> --------[foo][71], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=7xbaRZI2QPKdvBrE5iT0JQ]
  1> --------[foo][32], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bFplE9pGTY-uTVHdl5RByQ]
  1> --------[foo][43], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wsdW3kXSRZakgjBEUoLNOw]
  1> --------[foo][52], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=zxDyzd0gTISJfR05fNkZSw]
  1> --------[foo][3], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=j_ePRsVJRm2DPaDHu-q-tg]
  1> --------[foo][92], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OBJ-9NKNTU6kZQ9R0a0s6A]
  1> --------[foo][82], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=5L3LOKOiR5mGoFwWoCxkHg]
  1> --------[foo][41], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wTks5SaaQ2a7i74Cy7WwXA]
  1> --------[foo][50], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=D2n8YmDQSbab6YaDEmNk-w]
  1> --------[foo][33], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=i0GmrglcSyWLeM7jBmtU7Q]
  1> --------[foo][57], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LngaeINzSQCPywze9NU  1> 2Rw]
  1> --------[foo][18], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=VxHWMVCUThy08tOoJ6VAww]
  1> --------[foo][61], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9EPpu_ThT8u1HXDIX-DVzQ]
  1> --------[foo][48], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LE_hapDHTdyGqdUm1xNk_Q]
  1> --------[foo][51], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=UWCrnlekRJ6kTKufo_cmUg]
  1> --------[foo][85], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=0OIL6unHQMyhG0MpuEXrdQ]
  1> --------[foo][38], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=ir4h6zK0SDGCzYEEu5QBUw]
  1> ---- unassigned

  1> pending tasks:
  1> tasks: (1):
  1> 933/URGENT/shard-started StartedShardEntry{shardId [[foo][98]], allocationId [AHHzR21gRwyH9m8Z2tDMoA], primary term [1], message [after peer recovery]}/36ms
andrross commented 2 years ago

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

jainankitk commented 2 years ago

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

I don't think 100 shards is the issue here. I have been able to run the test on my machine several times without any issue. Though thinking more on it, the test might be able to run with even smaller number like 25 or 50 primary shards. I will wait to see if the issue is reported by anyone else.

dblock commented 2 years ago

https://github.com/opensearch-project/OpenSearch/pull/2069#issuecomment-1039522591

jainankitk commented 2 years ago

#2069 (comment)

Okay, I can see that some of the shards were initializing. Considering reducing the number of shards instead of increasing the timeout to not increase overall test suite time

dblock commented 2 years ago

https://github.com/opensearch-project/OpenSearch/pull/2069#issuecomment-1039615096

kartg commented 2 years ago

https://github.com/opensearch-project/OpenSearch/pull/2096#issuecomment-1039647986

jainankitk commented 2 years ago

The test was failing due to some replica shards initializing and completing before last primary shard could finish initializing. Discussed the issue with @dblock this morning to make test more predictable. Suggestion was to add shards per node constraint to allow exactly 50% of shards to relocate on new nodes. Including the constraint helped make tests really lightweight (completes in less than 5 seconds) and ran locally without any failure more than 500 times

jainankitk commented 2 years ago

All above reported failures were for this test before the latest fix.

No recent failures, this issue can be resolved for good - @VachaShah

jainankitk commented 2 years ago

This issue can be closed. @VachaShah @dblock @andrross