Closed dblock closed 1 year ago
These test failures are happening when performing upgrade from previous major latest minor i.e. 1.3.12
. From limited logs, it appears something catastrophic in the cluster bringing down primary (possibly all) shards and thus making them unavailable for any requests.
REPRODUCE WITH: ./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest' --tests "org.opensearch.backwards.IndexingIT.testIndexVersionPropagation" -Dtests.seed=EDE78F238815CE08 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-OM -Dtests.timezone=Asia/Katmandu -Druntime.java=17
2> org.opensearch.client.ResponseException: method [PUT], host [http://[::1/]:39691], URI [indexversionprop/_doc/1], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[indexversionprop][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[indexversionprop][0]] containing [index {[indexversionprop][_doc][1], source[{\"test\": \"test_BC\"}]}]]"}],"type":"unavailable_shards_exception","reason":"[indexversionprop][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[indexversionprop][0]] containing [index {[indexversionprop][_doc][1], source[{\"test\": \"test_BC\"}]}]]"},"status":503}
at __randomizedtesting.SeedInfo.seed([EDE78F238815CE08:6C74DB032DE30857]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:335)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
at app//org.opensearch.backwards.IndexingIT.indexDocs(IndexingIT.java:70)
at app//org.opensearch.backwards.IndexingIT.indexDocWithConcurrentUpdates(IndexingIT.java:82)
org.opensearch.backwards.IndexingIT.testUpdateSnapshotStatus
org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:42717], URI [test-snapshot-index/_doc/0], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[test-snapshot-index][7] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[test-snapshot-index][7]] containing [index {[test-snapshot-index][_doc][0], source[{\"test\": \"test_Io\"}]}]]"}],"type":"unavailable_shards_exception","reason":"[test-snapshot-index][7] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[test-snapshot-index][7]] containing [index {[test-snapshot-index][_doc][0], source[{\"test\": \"test_Io\"}]}]]"},"status":503}
org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints
org.opensearch.backwards.IndexingIT.classMethod
java.lang.Exception: Test abandoned because suite timeout was reached.
Both 1,2 are failing due to unavailable_shards_exception
exception possibly a catastrophic failure resulting in unavailable shards.
3,4 failed due to timeouts. The timeout is reported while waiting for indexing operation result. I suspect this is related to 1,2 above.
2> "TEST-IndexingIT.testSeqNoCheckpoints-seed#[EDE78F238815CE08]" ID=232 WAITING on org.apache.http.concurrent.BasicFuture@798608c5
2> at java.****@17.0.7/java.lang.Object.wait(Native Method)
2> - waiting on org.apache.http.concurrent.BasicFuture@798608c5
2> at java.****@17.0.7/java.lang.Object.wait(Object.java:338)
2> at app//org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:82)
2> at app//org.apache.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70)
2> at app//org.opensearch.client.RestClient.performRequest(RestClient.java:328)
2> at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
2> at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
2> at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
2> at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
2> at app//org.opensearch.backwards.IndexingIT.indexDocs(IndexingIT.java:70)
2> at app//org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints(IndexingIT.java:204)
This error is not repro'able on latest 2.9 (https://github.com/opensearch-project/OpenSearch/commit/3a7c95a9112d79321afe158486025936f6d79282) with and without given seed.
) ➜ OpenSearch git:(2.9) ./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest' --tests "org.opensearch.backwards.IndexingIT.testIndexVersionPropagation" -Dtests.seed=EDE78F238815CE08 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-OM -Dtests.timezone=Asia/Katmandu -Druntime.java=17
> Configure project :
Invalid Java installation found at '/Library/Java/JavaVirtualMachines/jdk-14.jdk/Contents/Home' (Java home). It will be re-checked in the next build. This might have performance impact if it keeps failing. Run the 'javaToolchains' task for more details.
=======================================
OpenSearch Build Hamster says Hello!
Gradle Version : 8.1.1
OS Info : Mac OS X 13.4 (x86_64)
Runtime JDK Version : 17 (Eclipse Temurin JDK)
Runtime java.home : /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home
Gradle JDK Version : 17 (Eclipse Temurin JDK)
Gradle java.home : /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home
Random Testing Seed : EDE78F238815CE08
In FIPS 140 mode : false
=======================================
> Task :distribution:bwc:maintenance:checkoutBwcBranch
Performing checkout of opensearch-project/1.3...
Checkout hash for :distribution:bwc:maintenance is d6c06d2a93614174c76487a0e7b3280d2311cf67
> Task :qa:mixed-cluster:v1.3.12#mixedClusterTest
Test cluster endpoints are: [::1]:53999,127.0.0.1:54000,[::1]:54006,127.0.0.1:54007,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
Upgrading one node to create a mixed cluster
Upgrade complete, endpoints are: [::1]:54246,127.0.0.1:54247,[::1]:54006,127.0.0.1:54007,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
Upgrading another node to create a mixed cluster
Upgrading complete, endpoints are: [::1]:54246,127.0.0.1:54247,[::1]:54356,127.0.0.1:54357,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/singhnjb/OpenSearch/test/framework/build/distributions/framework-2.9.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/singhnjb/.gradle/wrapper/dists/gradle-8.1.1-all/bs1rrjki8hh9bujwbsqnxtuzr/gradle-8.1.1/lib/plugins/gradle-testing-base-8.1.1.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release
BUILD SUCCESSFUL in 1m 29s
189 actionable tasks: 7 executed, 182 up-to-date
I see the original issue where these failures were reported happened on commit https://github.com/opensearch-project/OpenSearch/commit/48905487f6859c7844105cd831ab1a0fc810a92e dated July 12. There were few commits after that (reverts?) that might possibly fixed these.
Closing the issue as it is not repro'able CC @dblock
Also unable to reproduce this on 2.x branch.
./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest'
BUILD SUCCESSFUL in 12m 45s
Describe the bug
https://github.com/opensearch-project/OpenSearch/issues/8662 https://build.ci.opensearch.org/job/gradle-check/20468
Expected behavior A clear and concise description of what you expected to happen.