Open sohami opened 8 months ago
I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)
The following command will reliably reproduce the failure for me:
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock" -Dtests.iters=100
If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.
This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here. @psychbot @gbbafna @Bukhtawar @ashking94 Sorry for the spam folks but you were all involved with the review of the PR so want to make sure you're aware.
Adding Storage:Remote
label as it appears PR #9105 introduced this flakiness
@amkhar @gauravruhela @ramaran Over the past 30 days, this test has adversely affected 17 pull requests (PRs), including [#12459, #12382, #12376, #12375, #12374, #12267 (repeated), #12180, #12163 (repeated), #12151, #12133, #12117, #12111 (repeated)].
Please prioritize fixing this test or disabling the test case until it can be fixed.
java.lang.AssertionError: Missing cluster-manager, expected nodes: [{node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, {node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, {node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}] and actual cluster states [cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 10
state uuid: R5lx6pcBSomVZDCx7jW65Q
from_diff: false
meta data version: 7
coordination_metadata:
term: 1
last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,4EedRXkRQVKI0fmGCb6Y1Q,NDGm--CAR-6KZLASPentjg}
voting tombstones: []
[test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
index-graveyard: IndexGraveyard[[]]
blocks:
_global_:
2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes:
{node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, local
{node_t1}{NDGm--CAR-6KZLASPentjg}{XHgAw-wyToy7ZtQ8rsTQ9g}{127.0.0.1}{127.0.0.1:41977}{dimr}{shard_indexing_pressure_enabled=true}
{node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
routing_table (version 7):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]
routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [R], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
--------[test][2], node[NDGm--CAR-6KZLASPentjg], [P], s[STARTED], a[id=dUkUHeD9QGuCY8do7NSPJg]
--------[test][1], node[NDGm--CAR-6KZLASPentjg], [R], s[STARTED], a[id=wMFwg3TiQV6Bq7tS1dmSVQ]
---- unassigned
, cluster uuid: y97dDapYSby5Tqr8dZbPZA [committed: true]
version: 11
state uuid: KWBexCcJSmiAcjbR6BWO7w
from_diff: false
meta data version: 8
coordination_metadata:
term: 2
last_committed_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
last_accepted_config: VotingConfiguration{ymARqza7Q0eocUFwC_3sbQ,NDGm--CAR-6KZLASPentjg,4EedRXkRQVKI0fmGCb6Y1Q}
voting tombstones: []
[test/5EwrhsdDT2ShsBqHn77r-A]: v[7], mv[2], sv[1], av[1]
0: p_term [1], isa_ids [wYcVyMvcQdOEOfu8mNcWfg, J1Y7tiTKQuCNpyQDskxqMQ]
1: p_term [1], isa_ids [wMFwg3TiQV6Bq7tS1dmSVQ, zs4uh3wmRqiQs8HO7VEzSQ]
2: p_term [1], isa_ids [dUkUHeD9QGuCY8do7NSPJg, q-h4jojQSkyi2z-2EroIlQ]
metadata customs:
index-graveyard: IndexGraveyard[[]]
nodes:
{node_t2}{ymARqza7Q0eocUFwC_3sbQ}{WhCTBd4tRa2tgW43N9mBnQ}{127.0.0.1}{127.0.0.1:42273}{dimr}{shard_indexing_pressure_enabled=true}, cluster-manager
{node_t0}{4EedRXkRQVKI0fmGCb6Y1Q}{tiXuHCRERKC7OsVyg0BuTg}{127.0.0.1}{127.0.0.1:37015}{dimr}{shard_indexing_pressure_enabled=true}
{node_t3}{NDGm--CAR-6KZLASPentjg}{AYdxiLmJTPKxFQ6pCIBxsA}{127.0.0.1}{127.0.0.1:44999}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 8):
-- index [[test/5EwrhsdDT2ShsBqHn77r-A]]
----shard_id [test][0]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
----shard_id [test][1]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
----shard_id [test][2]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
routing_nodes:
-----node_id[ymARqza7Q0eocUFwC_3sbQ][V]
--------[test][0], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=J1Y7tiTKQuCNpyQDskxqMQ]
--------[test][2], node[ymARqza7Q0eocUFwC_3sbQ], [P], s[STARTED], a[id=q-h4jojQSkyi2z-2EroIlQ]
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
--------[test][1], node[4EedRXkRQVKI0fmGCb6Y1Q], [P], s[STARTED], a[id=zs4uh3wmRqiQs8HO7VEzSQ]
--------[test][0], node[4EedRXkRQVKI0fmGCb6Y1Q], [R], s[STARTED], a[id=wYcVyMvcQdOEOfu8mNcWfg]
-----node_id[NDGm--CAR-6KZLASPentjg][V]
---- unassigned
--------[test][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
--------[test][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2024-05-17T14:05:51.567Z], delayed=false, details[node_left [NDGm--CAR-6KZLASPentjg]], allocation_status[no_attempt]]
, cluster uuid: _na_ [committed: false]
version: 0
state uuid: ceEdRNQoSp2wgPupsMUYdA
from_diff: false
meta data version: 0
coordination_metadata:
term: 0
last_committed_config: VotingConfiguration{}
last_accepted_config: VotingConfiguration{}
voting tombstones: []
metadata customs:
index-graveyard: IndexGraveyard[[]]
blocks:
_global_:
1,state not recovered / initialized, blocks READ,WRITE,METADATA_READ,METADATA_WRITE,CREATE_INDEX 2,no cluster-manager, blocks WRITE,METADATA_WRITE
nodes:
{node_t4}{4EedRXkRQVKI0fmGCb6Y1Q}{rsoAXMTPQNW7slygLMSvQQ}{127.0.0.1}{127.0.0.1:35601}{dimr}{shard_indexing_pressure_enabled=true}, local
routing_table (version 0):
routing_nodes:
-----node_id[4EedRXkRQVKI0fmGCb6Y1Q][V]
---- unassigned
]
Describe the bug Test
org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock
is flakyTo Reproduce
Expected behavior Test should always pass
Plugins Standard
Screenshots
Host/Environment (please complete the following information): https://build.ci.opensearch.org/job/gradle-check/25287/testReport/junit/org.opensearch.cluster/MinimumClusterManagerNodesIT/testThreeNodesNoClusterManagerBlock/
Additional context https://build.ci.opensearch.org/job/gradle-check/25287/
I (@andrross) am adding the content from this comment to the description here because it has now been buried in the comment stream:
I believe I have traced this back to the commit that introduced the flakiness: 9119b6dc20ea11d95a399c68505f1d858b78e30e (#9105)
The following command will reliably reproduce the failure for me:
If I select the commit immediately preceding 9119b6dc20e then it does not reproduce.
This is a bit concerning because the commit in question is related to the remote store feature but MinimumClusterManagerNodesIT does not do anything related to remote store, so it is possible there is a significant regression here.