meilisearch / milli

Search engine library for Meilisearch ⚡️
MIT License
464 stars 82 forks source link

Avoid a prefix-related worst-case scenario in the proximity criterion #733

Closed loiclec closed 1 year ago

loiclec commented 1 year ago

Pull Request

Related issue

Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

What does this PR do?

When a query ends with a word and a prefix, such as:

word pr

Then we first determine whether pre could possibly be in the proximity prefix database before querying it. There are then three possibilities:

  1. pr is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of pre through the FST and query the regular proximity databases.

  2. pr is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair word pre. This is done as follows:

    1. Only find the documents where word is in proximity to pre exactly (no derivations)
    2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8
  3. pr is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

  1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
  2. For common prefixes of more than two letters: we no longer distinguish between any proximities
  3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query heard pr (IF pr is the prefix of more than 200 words in the dataset):

[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":

[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]

But the following document would be considered more relevant than the two documents above:

{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.


Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the songs benchmark dataset.

1. 10x 'a': 
    - 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
    - set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
    - dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
    - set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
    - dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
    - set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
    - dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
    - set-based: 57ms ⟹ 20ms         = good, ~2x speedup
    - dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
    - set-based: 300ms ⟹ 60ms        = very good, 5x speedup
    - dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
    - set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
    - dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
    - set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
    - dynamic: 82ms ⟹ 35ms           = good, ~2x speedup

Performance is often significantly better, but there is also one regression in the set-based implementation with the query b b b b b b b b b b.

loiclec commented 1 year ago

I am going to merge it now, because we want a milli release today, but I'll start my investigation of the performance regression today :) bors merge

bors[bot] commented 1 year ago

Build succeeded:

loiclec commented 1 year ago

The performance regression was due to a mistake in the previous implementation, which only considered the rightward proximity between the left and right word in some cases.