Open danithaca opened 1 year ago
By the description of the issue I assume this works reliably on a 1-node setup?
You're using ES 7.9, so first we should figure out whether this was fixed by ES 7.10, and/or whether it's still broken in OpenSearch 2.x. Want to try and narrow it down with newer versions of the software?
For AWS do open a ticket with support, they might know of another case like this.
This is definitely still applicable in recent releases of OpenSearch.
The issue is that doc frequency for terms is evaluated per shard at the Lucene level and suggestions are returned if the term's frequency (on the given shard) is below the threshold. I'm pretty sure it doesn't depend on the node count.
I personally like @noCharger's approach 2 in https://github.com/opensearch-project/OpenSearch/issues/8174 to address this. The suggestions can come back from the shards, along with the evaluated term's frequency. If the sum of a term's frequencies across all the shards exceeds the new threshold parameter, then we don't offer suggestions (for that term).
Describe the bug
We have a multi-shard setup and we want to support query Spell Correction using "Did you mean" TermSuggester as described at https://opensearch.org/docs/latest/search-plugins/searching-data/did-you-mean/#term-suggester. However, it sometimes does not work as expected. For example, when search
actor
(and this term appears multiple times in the corpus), it gets incorrectly suggested to beafter
,action
,altos
(see screenshot below).To Reproduce
Step 1: Create a simple index, with 8 shards
Step 2: Ingest 5 documents. Note that
actor
appears multiple times in the corpus.Step 3: Make a TermSuggester call, and note that
actor
got incorrectly suggested toafter
etc. We tried different values formax_term_freq
, such as10
,0
,0.01
,0.0001
and the same problem persists.Expected behavior
According to the documentation https://opensearch.org/docs/latest/search-plugins/searching-data/did-you-mean/#term-suggester, we expect
max_term_freq
controls at what point a term is considered a valid word and not as a typo to be corrected at all. For example, given thatactor
appears multiple times and ifmax_term_freq
is set to be1
, thenactor
would be treated as a valid word because it appears>1
and should not get corrected. However,max_term_freq
doesn't seem to affect the results and the suggestion is obviously wrong.Plugins N/A
Screenshots
Host/Environment (please complete the following information):
7.9.1
Additional context Related issue: #4529 CC: @macohen @msfroh @noCharger