quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.22k stars 679 forks source link

Change const parameter of bm25 #2195

Open fulmicoton opened 1 year ago

fulmicoton commented 1 year ago

Following https://github.com/quickwit-oss/search-benchmark-game/issues/45 we might want to update the BM25 constants to k1=0.9 and b=0.4

I don't think we want to make it configurable yet. (We have yet to meet a user who had a good rationale to tweak these numbers)

Tantivy cheats in its blockmax wand computation: we do not store all partial minimima for (term_freq, fieldnorm) but instead store the single best (term_freq, fieldnorm) for a given set of constants (k1, B) and a doc freq measure at the scale of the given segment.

The (k1, B) therefore needs to be known at indexing. The doc_freq part actually makes the result possibly a tiny bit inaccurate. While it is difficult to craft such an example, for an index with several segments it might be possible to have the doc_freq == segment_docfreq approximation affect, pick an inaccurate argmax, and miss the block containing the best doc through block pruning.

cpg314 commented 2 weeks ago

Some notes expanding on @fulmicoton's comment for future reference, to the best of my understanding:

856 added BlockWAND, which significantly improves performance on some queries.

The index contains the precomputed (freq, fieldnorm) with maximal tf scores: https://github.com/quickwit-oss/tantivy/blob/c71ea7b2effe6494c1a2b9e234db71f89255c06a/src/postings/serializer.rs#L390-L410 The tf_factor function https://github.com/quickwit-oss/tantivy/blob/c71ea7b2effe6494c1a2b9e234db71f89255c06a/src/query/bm25.rs#L192 calls https://github.com/quickwit-oss/tantivy/blob/c71ea7b2effe6494c1a2b9e234db71f89255c06a/src/query/bm25.rs#L58-L61 so that, as mentioned, the index depends on the K1 and B constants in https://github.com/quickwit-oss/tantivy/blob/c71ea7b2effe6494c1a2b9e234db71f89255c06a/src/query/bm25.rs#L8-L9 and a naive attempt at changing these constants dynamically or statically without regenerating the index will yield incorrect results.


(We have yet to meet a user who had a good rationale to tweak these numbers)

I'm not sure I understand this. The constants in the BM25 formula allow adjusting the length penalty (B) and keyword saturation ( K1), which I would imagine can depend on the corpus. The elastic blog does write:

It’s worth noting that picking b and k1 are generally not the first thing to do when your users aren’t finding documents quickly. The default values of b = 0.75 and k1 = 1.2 work pretty well for most corpuses, so you’re likely fine with the defaults.

but below

It’s been fairly well studied that there are no “best” b and k1 values for all data/queries. Users that do change the b and k1 parameters typically do so incrementally by evaluating each increment. The Rank Eval API in Elasticsearch can help with the evaluation stage.

The use case that brought me here was toying with searching the English Wikipedia articles. With the B=0.75 parameter, long articles seem too strongly penalized, and searching for "Chopin" only returns the Frederic Chopin article (462 occurences in 15k words) after many less relevant results (such as Chopin Year ). Reducing B (and reindexing) seems to solve this issue.