quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.23k stars 679 forks source link

Old optimization idea for stop words #2194

Open fulmicoton opened 1 year ago

fulmicoton commented 1 year ago

Motivated by https://github.com/quickwit-oss/search-benchmark-game/pull/51

I am flushing here an old idea I have never implemented but could really help. I think it is rather common for small index with high QPS to index ngram to accelerate phrase queries. Unfortunately it is super expensive, and therefore unpractical for most use cases.

A great bang for the buck however would be to only index two-grams starting by a stop word. Of course this needs an asymmetric pair of tokenizer on the indexing and the query side.

"the happy tree friends" then requires indexing ["the", "the happy", "tree", "friends"]. The query then becomes PhraseQuery["the happy", "tree", "friends"]

This avoids dealing with the super costly "the" posting list.

jpountz commented 1 year ago

Lucene implements this idea via the common grams filter: https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/commongrams/package-summary.html. That said it's not especially easy to integrate with Lucene's query parsers, and I don't recall seeing anyone using it.

We have this long-standing issue in Elasticsearch about integrating the index_phrases option and this common grams filter that we never got to it: https://github.com/elastic/elasticsearch/issues/31427.