opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
8.89k stars 1.63k forks source link

[BUG] org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock flaky #10006

Open sohami opened 8 months ago

sohami commented 8 months ago

Describe the bug Test org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock is flaky

To Reproduce

سبت 11, 2023 1:44:23 م com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[#339,opensearch[node_t2][clusterApplierService#updateTask][T#1],5,TGRP-MinimumClusterManagerNodesIT]
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [test][2], node[IADuWGkCTpuWEnWUFcbkSQ], [P], s[STARTED], a[id=oar4Dv6STMWSzO-FDH4bMA]
    at __randomizedtesting.SeedInfo.seed([7E7C985F304948B0]:0)
    at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:752)
    at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710)
    at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650)
    at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293)
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
    at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
    at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
    at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1623)

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.seed=7E7C985F304948B0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SD -Dtests.timezone=Europe/Lisbon -Druntime.java=20
NOTE: leaving temporary files on disk at: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.cluster.MinimumClusterManagerNodesIT_7E7C985F304948B0-001
NOTE: test params are: codec=Asserting(Lucene95), sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=ar-SD, timezone=Europe/Lisbon
NOTE: Linux 5.15.0-1039-aws amd64/Eclipse Adoptium 20.0.2 (64-bit)/cpus=32,threads=1,free=204825744,total=536870912
NOTE: All tests run in this JVM: [PendingTasksBlocksIT, GetIndexIT, ActiveShardsObserverIT, MinimumClusterManagerNodesIT]

Expected behavior Test should always pass

Plugins Standard

Screenshots

Host/Environment (please complete the following information): https://build.ci.opensearch.org/job/gradle-check/25287/testReport/junit/org.opensearch.cluster/MinimumClusterManagerNodesIT/testThreeNodesNoClusterManagerBlock/

Additional context https://build.ci.opensearch.org/job/gradle-check/25287/


I (@andrross) am adding the content from this comment to the description here because it has now been buried in the comment stream:

I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.

andrross commented 7 months ago

https://github.com/opensearch-project/OpenSearch/pull/10519#issuecomment-1754410824

shwetathareja commented 7 months ago

https://github.com/opensearch-project/OpenSearch/pull/10670#issuecomment-1776629574

amkhar commented 7 months ago

https://github.com/opensearch-project/OpenSearch/pull/10964#issuecomment-1782966285

ashking94 commented 7 months ago

https://github.com/opensearch-project/OpenSearch/pull/10986#issuecomment-1789416684

andrross commented 6 months ago

I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)

The following command will reliably reproduce the failure for me:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100

If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.

This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here. @psychbot @gbbafna @Bukhtawar @ashking94 Sorry for the spam folks but you were all involved with the review of the PR so want to make sure you're aware.

peternied commented 6 months ago
andrross commented 3 months ago

Adding Storage:Remote label as it appears PR #9105 introduced this flakiness

peternied commented 3 months ago

@amkhar @gauravruhela @ramaran Over the past 30 days, this test has adversely affected 17 pull requests (PRs), including [#12459, #12382, #12376, #12375, #12374, #12267 (repeated), #12180, #12163 (repeated), #12151, #12133, #12117, #12111 (repeated)].

Please prioritize fixing this test or disabling the test case until it can be fixed.

reta commented 2 weeks ago
java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 10
state uuid: R5lx6pcBSomVZDCx7jW65Q
from_diff: false
meta data version: 7
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, local
   {node_t1}{NDGm--CAR-6KZLASPentjg}{XHgAw-wyToy7ZtQ8rsTQ9g}{127.0.0.1}{127.0.0.1:41977}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
routing_table (version 7):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
---- unassigned
, cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 11
state uuid: KWBexCcJSmiAcjbR6BWO7w
from_diff: false
meta data version: 8
   coordination_metadata:
      term: 2
      last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
      voting tombstones: []
   [test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
      0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
      1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
      2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, cluster-manager
   {node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
   {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 8):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]

routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
---- unassigned
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: ceEdRNQoSp2wgPupsMUYdA
from_diff: false
meta data version: 0
   coordination_metadata:
      term: 0
      last_committed_config: VotingConfiguration{}
      last_accepted_config: VotingConfiguration{}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
blocks: 
   _global_:
      1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX      2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes: 
   {node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
---- unassigned
]