opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.68k stars 1.79k forks source link

[BUG] org.opensearch.backwards.IndexingIT.* fails #8779

Closed dblock closed 1 year ago

dblock commented 1 year ago

Describe the bug

https://github.com/opensearch-project/OpenSearch/issues/8662 https://build.ci.opensearch.org/job/gradle-check/20468

- org.opensearch.backwards.IndexingIT.testIndexVersionPropagation
- org.opensearch.backwards.IndexingIT.testUpdateSnapshotStatus
- org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints
- org.opensearch.backwards.IndexingIT.classMethod

Expected behavior A clear and concise description of what you expected to happen.

dreamer-89 commented 1 year ago

These test failures are happening when performing upgrade from previous major latest minor i.e. 1.3.12. From limited logs, it appears something catastrophic in the cluster bringing down primary (possibly all) shards and thus making them unavailable for any requests.

  1. org.opensearch.backwards.IndexingIT.testIndexVersionPropagation
REPRODUCE WITH: ./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest' --tests "org.opensearch.backwards.IndexingIT.testIndexVersionPropagation" -Dtests.seed=EDE78F238815CE08 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-OM -Dtests.timezone=Asia/Katmandu -Druntime.java=17
  2> org.opensearch.client.ResponseException: method [PUT], host [http://[::1/]:39691], URI [indexversionprop/_doc/1], status line [HTTP/1.1 503 Service Unavailable]
    {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[indexversionprop][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[indexversionprop][0]] containing [index {[indexversionprop][_doc][1], source[{\"test\": \"test_BC\"}]}]]"}],"type":"unavailable_shards_exception","reason":"[indexversionprop][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[indexversionprop][0]] containing [index {[indexversionprop][_doc][1], source[{\"test\": \"test_BC\"}]}]]"},"status":503}
        at __randomizedtesting.SeedInfo.seed([EDE78F238815CE08:6C74DB032DE30857]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:335)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.backwards.IndexingIT.indexDocs(IndexingIT.java:70)
        at app//org.opensearch.backwards.IndexingIT.indexDocWithConcurrentUpdates(IndexingIT.java:82)
  1. org.opensearch.backwards.IndexingIT.testUpdateSnapshotStatus

    org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:42717], URI [test-snapshot-index/_doc/0], status line [HTTP/1.1 503 Service Unavailable]
    {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[test-snapshot-index][7] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[test-snapshot-index][7]] containing [index {[test-snapshot-index][_doc][0], source[{\"test\": \"test_Io\"}]}]]"}],"type":"unavailable_shards_exception","reason":"[test-snapshot-index][7] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[test-snapshot-index][7]] containing [index {[test-snapshot-index][_doc][0], source[{\"test\": \"test_Io\"}]}]]"},"status":503}
  2. org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints

  3. org.opensearch.backwards.IndexingIT.classMethod

    java.lang.Exception: Test abandoned because suite timeout was reached.

Both 1,2 are failing due to unavailable_shards_exception exception possibly a catastrophic failure resulting in unavailable shards.

3,4 failed due to timeouts. The timeout is reported while waiting for indexing operation result. I suspect this is related to 1,2 above.

  2> "TEST-IndexingIT.testSeqNoCheckpoints-seed#[EDE78F238815CE08]" ID=232 WAITING on org.apache.http.concurrent.BasicFuture@798608c5
  2>    at java.****@17.0.7/java.lang.Object.wait(Native Method)
  2>    - waiting on org.apache.http.concurrent.BasicFuture@798608c5
  2>    at java.****@17.0.7/java.lang.Object.wait(Object.java:338)
  2>    at app//org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:82)
  2>    at app//org.apache.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70)
  2>    at app//org.opensearch.client.RestClient.performRequest(RestClient.java:328)
  2>    at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
  2>    at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
  2>    at app//org.opensearch.client.RestClient.performRequest(RestClient.java:351)
  2>    at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
  2>    at app//org.opensearch.backwards.IndexingIT.indexDocs(IndexingIT.java:70)
  2>    at app//org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints(IndexingIT.java:204)
dreamer-89 commented 1 year ago

This error is not repro'able on latest 2.9 (https://github.com/opensearch-project/OpenSearch/commit/3a7c95a9112d79321afe158486025936f6d79282) with and without given seed.

) ➜  OpenSearch git:(2.9) ./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest' --tests "org.opensearch.backwards.IndexingIT.testIndexVersionPropagation" -Dtests.seed=EDE78F238815CE08 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-OM -Dtests.timezone=Asia/Katmandu -Druntime.java=17

> Configure project :
Invalid Java installation found at '/Library/Java/JavaVirtualMachines/jdk-14.jdk/Contents/Home' (Java home). It will be re-checked in the next build. This might have performance impact if it keeps failing. Run the 'javaToolchains' task for more details.
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 13.4 (x86_64)
  Runtime JDK Version   : 17 (Eclipse Temurin JDK)
  Runtime java.home     : /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home
  Gradle JDK Version    : 17 (Eclipse Temurin JDK)
  Gradle java.home      : /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home
  Random Testing Seed   : EDE78F238815CE08
  In FIPS 140 mode      : false
=======================================

> Task :distribution:bwc:maintenance:checkoutBwcBranch
Performing checkout of opensearch-project/1.3...
Checkout hash for :distribution:bwc:maintenance is d6c06d2a93614174c76487a0e7b3280d2311cf67

> Task :qa:mixed-cluster:v1.3.12#mixedClusterTest
Test cluster endpoints are: [::1]:53999,127.0.0.1:54000,[::1]:54006,127.0.0.1:54007,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
Upgrading one node to create a mixed cluster
Upgrade complete, endpoints are: [::1]:54246,127.0.0.1:54247,[::1]:54006,127.0.0.1:54007,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
Upgrading another node to create a mixed cluster
Upgrading complete, endpoints are: [::1]:54246,127.0.0.1:54247,[::1]:54356,127.0.0.1:54357,[::1]:54014,127.0.0.1:54015,[::1]:54019,127.0.0.1:54020
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/singhnjb/OpenSearch/test/framework/build/distributions/framework-2.9.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/singhnjb/.gradle/wrapper/dists/gradle-8.1.1-all/bs1rrjki8hh9bujwbsqnxtuzr/gradle-8.1.1/lib/plugins/gradle-testing-base-8.1.1.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 1m 29s
189 actionable tasks: 7 executed, 182 up-to-date
dreamer-89 commented 1 year ago

I see the original issue where these failures were reported happened on commit https://github.com/opensearch-project/OpenSearch/commit/48905487f6859c7844105cd831ab1a0fc810a92e dated July 12. There were few commits after that (reverts?) that might possibly fixed these.

dreamer-89 commented 1 year ago

Closing the issue as it is not repro'able CC @dblock

noCharger commented 1 year ago

Also unable to reproduce this on 2.x branch.

./gradlew ':qa:mixed-cluster:v1.3.12#mixedClusterTest'
BUILD SUCCESSFUL in 12m 45s