quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.04k stars 671 forks source link

Elective n-grams #314

Open fulmicoton opened 6 years ago

fulmicoton commented 6 years ago

n-grams can considerably accelerate PhraseQuery.

Let's say one is trying to search The quest for the holy grail... A phrase query will compute the intersection of the posting list associated to each of these words, and then refine the documents by checking if this is indeed a phrase query match.

This works by taking each term in their docfreq order (tf is assumed to be very correlated with docfreq), and checking if each new terms is at the right position for the terms visited so far.

We might accelerate interseciton and matching by indexing ngrams. It can be seen as a way of caching part of the phrase query. For instance, assuming we had indexed the quest with its respective positions, our phrase query would apply on ["The quest", for, the, holy, grail].

This problem can be entirely addressed using our tokenizer pipeline...

At indexing time, a tokenfilter could add an extra token and emit.

- text:"The", position_increment: 0, token_length: 1
- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

At query time, the tokenfilter would have a slightly different behavior and emit.

- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

So the TokenFilter would have a different behavior for search and for indexing. Note that this is similar to a synonym token filter situation.

This therefore requires adding something to the tokenizer API to tell whether we are in search mode or indexing mode as we build the pipeline.

Not all n-grams ?

Indexing all grams would be overkill for this problems. Ideally the TokenFilter should be configurable, by the user :

fulmicoton commented 6 years ago

Related to #291