Old optimization idea for stop words

quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

MIT License

12.23k stars 679 forks source link

Motivated by https://github.com/quickwit-oss/search-benchmark-game/pull/51

I am flushing here an old idea I have never implemented but could really help. I think it is rather common for small index with high QPS to index ngram to accelerate phrase queries. Unfortunately it is super expensive, and therefore unpractical for most use cases.

A great bang for the buck however would be to only index two-grams starting by a stop word. Of course this needs an asymmetric pair of tokenizer on the indexing and the query side.

"the happy tree friends" then requires indexing ["the", "the happy", "tree", "friends"]. The query then becomes PhraseQuery["the happy", "tree", "friends"]

This avoids dealing with the super costly "the" posting list.

quickwit-oss / tantivy

Old optimization idea for stop words #2194