Closed kasundra07 closed 11 months ago
Pretty sure I've tracked this down to commit bc74731a537c461cd1d7666bc3be82fbe18e143d. I can't reproduce the error on the commit before that, but after bc74731a537c461cd1d7666bc3be82fbe18e143d I can reproduce within 10 or 20 attempts. It is still reproducible after the latest commit on main. It seems to fail for both concurrent and non-concurrent cases. @neetikasinghal can you take a look?
Was able to confirm on the test seed above D490856B1361B346
.
Tests pass when I comment out the following:
https://github.com/opensearch-project/OpenSearch/blob/5b505ec04c2684867901d1cabfed8431d7d34fe3/test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java#L1685-L1687
However with those changes the test is flaky.
This is true for both concurrent search enabled and disabled cases.
Query:
{
"common" : {
"field1" : {
"query" : "the huge fox",
"high_freq_operator" : "OR",
"low_freq_operator" : "OR",
"cutoff_frequency" : 0.01,
"minimum_should_match" : {
"low_freq" : "2"
},
"boost" : 1.0
}
}
}
Flaky Result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.3157668,
"hits": [{
"_index": "test",
"_id": "2",
"_score": 1.3157668,
"_source": {
"field1": "the quick lazy huge brown fox jumps over the tree"
}
}, {
"_index": "test",
"_id": "1",
"_score": 1.1068254,
"_source": {
"field1": "the quick brown fox"
}
}]
}
}
Correct Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.3157668,
"hits": [{
"_index": "test",
"_id": "2",
"_score": 1.3157668,
"_source": {
"field1": "the quick lazy huge brown fox jumps over the tree"
}
}]
}
}
Match All query in the flaky scenario does show the expected 3 docs in the index and no bogus docs but it seems like somehow the score for doc _id
1 is high enough to get returned in the flaky case sometimes.
Additionally, I did notice that this test is using a 1/1 shard configuration. After setting the index to 0 primaries I ran the test on the same seed D490856B1361B346
500 times and saw no failures. Before we would see around 20% of the tests fail in this flaky manner.
Current hypothesis is that the replica shard is not being refreshed even when the primary shard is so search requests routed to the replica will encounter this issue. However, the _refresh
API is supposed to be synchronous.
I wonder if there is an issue with _refresh
not waiting on the replicas then -- I'm not sure how else to explain why removing the replicas could fix the flakiness.
@andrross @reta do you know of any issues like this related to _refresh
off the top of your head?
From TRACE
logs I can see that there is test flakiness even when the search request is hitting the primary. Attaching some sample logs:
testCommonTermsQuery_1.txt testCommonTermsQuery_2.txt
Both are flaky test examples, in the first we see the search request hit the replica and in the second we see the search request hit the primary.
Quick excerpt:
[2023-11-15T19:32:50,238][TRACE][o.o.a.a.i.r.TransportShardRefreshAction] [node_s5] [test][0] refresh request executed on primary
[2023-11-15T19:32:50,318][TRACE][o.o.a.a.i.r.TransportShardRefreshAction] [node_s3] [test][0] refresh request executed on replica
[2023-11-15T19:32:50,325][DEBUG][o.o.s.q.QueryPhaseSearcherWrapper] [node_s3] Using non-concurrent aggregation processor over segments for request with context id [-ftBpaDxQ6-s9uvgqbm-1A][1]
[2023-11-15T19:32:50,325][DEBUG][o.o.s.q.QueryPhaseSearcherWrapper] [node_s3] Using non-concurrent search over segments for request with context id [-ftBpaDxQ6-s9uvgqbm-1A][1]
[2023-11-15T19:32:50,325][TRACE][o.o.s.f.FetchPhase ] [node_s3] [test][0] source[{"timeout":"1d","query":{"match_all":{"boost":1.0}}}], id[],
cluster uuid: 0G3YDLHTTpmWJZrGFEBIwA [committed: true]
version: 25
state uuid: MmpCNYRNRHGpyFVnnr2LBw
from_diff: false
meta data version: 20
coordination_metadata:
term: 1
last_committed_config: VotingConfiguration{qyM6N6JnTzC20m5eT6YiNQ,VewvEbaXQMeeqnL9EVM1cw,xeiWmyoIStSZl02qvB3WsA}
last_accepted_config: VotingConfiguration{qyM6N6JnTzC20m5eT6YiNQ,VewvEbaXQMeeqnL9EVM1cw,xeiWmyoIStSZl02qvB3WsA}
voting tombstones: []
[test/VYTqcztNQ8mZP5spA2-HmQ]: v[5], mv[2], sv[1], av[1]
0: p_term [1], isa_ids [oIyDaBU5RJCXL98jM8pOHg, xRjt2CCHQuiHlKD45x631A]
metadata customs:
index-graveyard: IndexGraveyard[[[index=[test/RxQIvuKTRjSPkdTQFWDwqA], deleteDate=2023-11-16T01:32:48.222Z]]]
nodes:
{node_s4}{OhN8Eeh3SuiVD7m3P0ly7Q}{_UyNQsmGSXehWfb1VP1XxA}{127.0.0.1}{127.0.0.1:61514}{dir}{shard_indexing_pressure_enabled=true}
{node_s5}{pWGsJ26ZTSmfrCNxbu-KoQ}{ykC9RMfESaGahq9O7sFFyw}{127.0.0.1}{127.0.0.1:61513}{dir}{shard_indexing_pressure_enabled=true}
{node_s1}{xeiWmyoIStSZl02qvB3WsA}{yI5sjc5dRxmWyVaz3t3pxA}{127.0.0.1}{127.0.0.1:61517}{imr}{shard_indexing_pressure_enabled=true}
{node_s0}{qyM6N6JnTzC20m5eT6YiNQ}{9x3UO5FBTu6JcDHHjxzGAQ}{127.0.0.1}{127.0.0.1:61515}{imr}{shard_indexing_pressure_enabled=true}, local, cluster-manager
{node_s2}{VewvEbaXQMeeqnL9EVM1cw}{f9oUv8jDQpux45DKb3cF2A}{127.0.0.1}{127.0.0.1:61512}{imr}{shard_indexing_pressure_enabled=true}
{node_s3}{9SWyj4IxQoKQhSS_1-SQNQ}{YIS1ZCFbQROMwtYM1dq4xg}{127.0.0.1}{127.0.0.1:61516}{dir}{shard_indexing_pressure_enabled=true}
routing_table (version 11):
-- index [[test/VYTqcztNQ8mZP5spA2-HmQ]]
----shard_id [test][0]
--------[test][0], node[pWGsJ26ZTSmfrCNxbu-KoQ], [P], s[STARTED], a[id=oIyDaBU5RJCXL98jM8pOHg]
--------[test][0], node[9SWyj4IxQoKQhSS_1-SQNQ], [R], s[STARTED], a[id=xRjt2CCHQuiHlKD45x631A]
routing_nodes:
-----node_id[pWGsJ26ZTSmfrCNxbu-KoQ][V]
--------[test][0], node[pWGsJ26ZTSmfrCNxbu-KoQ], [P], s[STARTED], a[id=oIyDaBU5RJCXL98jM8pOHg]
-----node_id[9SWyj4IxQoKQhSS_1-SQNQ][V]
--------[test][0], node[9SWyj4IxQoKQhSS_1-SQNQ], [R], s[STARTED], a[id=xRjt2CCHQuiHlKD45x631A]
-----node_id[OhN8Eeh3SuiVD7m3P0ly7Q][V]
---- unassigned
[2023-11-15T19:32:53,886][TRACE][o.o.a.a.i.r.TransportShardRefreshAction] [node_s5] [test][0] refresh request executed on primary
[2023-11-15T19:32:53,980][TRACE][o.o.a.a.i.r.TransportShardRefreshAction] [node_s3] [test][0] refresh request executed on replica
[2023-11-15T19:32:53,988][DEBUG][o.o.s.q.QueryPhaseSearcherWrapper] [node_s5] Using non-concurrent aggregation processor over segments for request with context id [m2pJBLRpTdKCot2Vjykfzg][5]
[2023-11-15T19:32:53,988][DEBUG][o.o.s.q.QueryPhaseSearcherWrapper] [node_s5] Using non-concurrent search over segments for request with context id [m2pJBLRpTdKCot2Vjykfzg][5]
[2023-11-15T19:32:53,989][TRACE][o.o.s.f.FetchPhase ] [node_s5] [test][0] source[{"query":{"common":{"field1":{"query":"the huge fox","high_freq_operator":"OR","low_freq_operator":"OR","cutoff_frequency":0.01,"minimum_should_match":{"low_freq":"2"},"boost":1.0}}}}], id[],
luster uuid: 0G3YDLHTTpmWJZrGFEBIwA [committed: true]
version: 49
state uuid: xWMpjrUtT_Gyv5QeRAdHig
from_diff: false
meta data version: 42
coordination_metadata:
term: 1
last_committed_config: VotingConfiguration{qyM6N6JnTzC20m5eT6YiNQ,VewvEbaXQMeeqnL9EVM1cw,xeiWmyoIStSZl02qvB3WsA}
last_accepted_config: VotingConfiguration{qyM6N6JnTzC20m5eT6YiNQ,VewvEbaXQMeeqnL9EVM1cw,xeiWmyoIStSZl02qvB3WsA}
voting tombstones: []
[test/o5D6DEvlS5e2pYuTaBIphQ]: v[5], mv[2], sv[1], av[1]
0: p_term [1], isa_ids [4eP3v5KMTreJgsW3xhmU_Q, oFE2grSERpGUPOylAATlmQ]
metadata customs:
index-graveyard: IndexGraveyard[[[index=[test/RxQIvuKTRjSPkdTQFWDwqA], deleteDate=2023-11-16T01:32:48.222Z], [index=[test/VYTqcztNQ8mZP5spA2-HmQ], deleteDate=2023-11-16T01:32:50.437Z], [index=[test/LnMqX2UcRSew8jYZFqbOPA], deleteDate=2023-11-16T01:32:52.198Z]]]
nodes:
{node_s4}{OhN8Eeh3SuiVD7m3P0ly7Q}{_UyNQsmGSXehWfb1VP1XxA}{127.0.0.1}{127.0.0.1:61514}{dir}{shard_indexing_pressure_enabled=true}
{node_s5}{pWGsJ26ZTSmfrCNxbu-KoQ}{ykC9RMfESaGahq9O7sFFyw}{127.0.0.1}{127.0.0.1:61513}{dir}{shard_indexing_pressure_enabled=true}
{node_s1}{xeiWmyoIStSZl02qvB3WsA}{yI5sjc5dRxmWyVaz3t3pxA}{127.0.0.1}{127.0.0.1:61517}{imr}{shard_indexing_pressure_enabled=true}
{node_s0}{qyM6N6JnTzC20m5eT6YiNQ}{9x3UO5FBTu6JcDHHjxzGAQ}{127.0.0.1}{127.0.0.1:61515}{imr}{shard_indexing_pressure_enabled=true}, local, cluster-manager
{node_s2}{VewvEbaXQMeeqnL9EVM1cw}{f9oUv8jDQpux45DKb3cF2A}{127.0.0.1}{127.0.0.1:61512}{imr}{shard_indexing_pressure_enabled=true}
{node_s3}{9SWyj4IxQoKQhSS_1-SQNQ}{YIS1ZCFbQROMwtYM1dq4xg}{127.0.0.1}{127.0.0.1:61516}{dir}{shard_indexing_pressure_enabled=true}
routing_table (version 25):
-- index [[test/o5D6DEvlS5e2pYuTaBIphQ]]
----shard_id [test][0]
--------[test][0], node[pWGsJ26ZTSmfrCNxbu-KoQ], [P], s[STARTED], a[id=oFE2grSERpGUPOylAATlmQ]
--------[test][0], node[9SWyj4IxQoKQhSS_1-SQNQ], [R], s[STARTED], a[id=4eP3v5KMTreJgsW3xhmU_Q]
routing_nodes:
-----node_id[pWGsJ26ZTSmfrCNxbu-KoQ][V]
--------[test][0], node[pWGsJ26ZTSmfrCNxbu-KoQ], [P], s[STARTED], a[id=oFE2grSERpGUPOylAATlmQ]
-----node_id[9SWyj4IxQoKQhSS_1-SQNQ][V]
--------[test][0], node[9SWyj4IxQoKQhSS_1-SQNQ], [R], s[STARTED], a[id=4eP3v5KMTreJgsW3xhmU_Q]
-----node_id[OhN8Eeh3SuiVD7m3P0ly7Q][V]
---- unassigned
It seems like there's an underlying problem with this query type itself instead of anything related to the new bogus docs indexing. I added a match all query with both primary and replica preference before the test query here and even in the flaky test cases the match all query is showing no bogus docs in either replica or primary.
Thanks @jed326. Today @msfroh and I spent some time on this. The addition of extra deleted docs does indeed change the behavior of one of the queries in a really non-obvious way that was difficult to track down. I'll post a PR tomorrow that should fix the test.
Describe the bug These test cases are flaky -
To Reproduce
Expected behavior The test must always pass for both the cases.
Additional context https://build.ci.opensearch.org/job/gradle-check/29721/