Open fulmicoton opened 1 year ago
Lucene implements this idea via the common grams filter: https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/commongrams/package-summary.html. That said it's not especially easy to integrate with Lucene's query parsers, and I don't recall seeing anyone using it.
We have this long-standing issue in Elasticsearch about integrating the index_phrases
option and this common grams filter that we never got to it: https://github.com/elastic/elasticsearch/issues/31427.
Motivated by https://github.com/quickwit-oss/search-benchmark-game/pull/51
I am flushing here an old idea I have never implemented but could really help. I think it is rather common for small index with high QPS to index ngram to accelerate phrase queries. Unfortunately it is super expensive, and therefore unpractical for most use cases.
A great bang for the buck however would be to only index two-grams starting by a stop word. Of course this needs an asymmetric pair of tokenizer on the indexing and the query side.
"the happy tree friends" then requires indexing ["the", "the happy", "tree", "friends"]. The query then becomes
PhraseQuery["the happy", "tree", "friends"]
This avoids dealing with the super costly "the" posting list.