Reduce the size of the `word_pair_proximity` database

loiclec commented 2 years ago

Currently, given the following text:

The lazy dog

The keys of the word_pair_proximity_docids database are:

dog  the  3
dog  lazy 2
lazy dog  1
lazy the  2
the  dog  2
the  lazy 1

For each pair of words, we store both the "rightward" proximity and the "leftward" proximity, which is equal to the rightward proximity + 1.

I am wondering whether we could reduce the size of the database by only storing the righward proximity. Given the same text, the database would contain the keys:

lazy dog  1
the  dog  2
the  lazy 1

This would help in two ways:

we reduce the size of the database on disk, which could be important because it is a very large one
it would also speed up indexing significantly

I don't think that we would lose any information. In fact, it seems like we would have more information at our disposal to tweak the search results. We can, for example, restrict a search to documents where word1 is followed by word2, or we could tweak the "proximity penalty" of backward word pairs, which is currently set to + 1 (I mean that given the word pair word1 .. word2 with a proximity of N, we also record word2 word1 N+1).

Another example:

ID 0:
run forest run

ID 1:
run in forest

currently yields:

forest in  2  [1]
forest run 1  [0]
forest run 3  [1]
in forest  1  [1]
in run     2  [1]
run forest 1  [0]
run forest 2  [1]
run in     1  [1]
run run    2  [0]

and with the suggested change, would yield:

run forest 1 [0]
run run    2 [0]
forest run 1 [0]
run in     1 [1]
run forest 2 [1]
in forest  1 [1]

To find the document ids where run is 2-close to forest, we currently search for run forest 2 and get:

run forest 2 [1]

With the new scheme, we need to execute three queries:

run forest 2
forest run 1
run forest 1

Then we compute 1 | (2 - 3):

1. run forest 2 [1]
2. forest run 1 [0]
3. run forest 1 [0]

  1 | (2 - 3) 
= [1] | ([0] - [0])
= [1]

The idea is that we want:

all the documents where run forest has a rightward proximity of 2
all the documents where run forest has a leftward proximity of 1 except those where run forest has a rightward proximity of 1, since the overall minimal proximity between run and forest in those documents is 1 and not 2.

This would be slower of course, but I don't know if the performance loss would be noticeable.

ManyTheFish commented 2 years ago

Hello @loiclec,

First, I think the performance loss would rarely be noticeable. The reason is that in a lot of cases, your equation 1 | (2 - 3) would be simplified in 1 | 2 because we can consider that run 1-close to forest has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.

Moreover, I think this improvement could enhance the relevancy. Making the distinction between a swap and a distance of 2 would reduce the probability of returning false positives when computing a PHRASE (in a way, this function may be refactored with this change).

To conclude, I think this is a good improvement and I don't see any drawbacks in terms of relevancy. 😄

loiclec commented 2 years ago

That's good to hear :-)

Have you also thought about the impact on the word_prefix_pair_proximity_docids database and its usage? Since this database is computed from the word_pair_proximity_docids, it will also be smaller.

I am not too familiar with its exact role in the execution of a search query. I wonder if we'll also want to use the 1 | (2 - 3) formula (or the simplification you talked about) or if we want to limit ourselves to searching for non-swapped word pairs there (just 1 instead of 1 | 2 - 3)

ManyTheFish commented 2 years ago

Hey @loiclec, I think if both databases have the same logic it's ok, the resulted documents comming from these databases are mainly computed in search/criteria/mod.rs#L450-L524

loiclec commented 2 years ago

Below is an update on this issue. It doesn't require any answer, I am mostly writing it here to document the process :)

Correctness of the word prefix pair database

@ManyTheFish and I discussed a bit more about this issue and found that there is a problem with the word_prefix_pair-proximity_docids database following the proposed refactor. For example, if someone types:

New York ci

and we have a document in the dataset that contains:

the city of New York

then we would expect that document to be returned among the results based on the prefix ci matching city and being in close proximity to new and york. However, if ci is in the word_prefix_pair_proximity_docids database, then it would only appear as:

the ci 1

and not:

new ci 3
york ci 4

which is what we want.

To solve this problem, we saw two potential solutions:

drop the word_prefix_pair_proximity_docids database and always find the words corresponding to a prefix via the FST
create another database that is equivalent to word_prefix_pair_proximity_docids except that the prefixes come first and are on the left, maybe called prefix_word_pair_proximity_docids.

I have started implementing (2) in #639 .

Correctness of the formula I originally proposed

Furthermore, I also realised that my formula 1 | (2 -3) is pretty much completely wrong. For example:

TEXT: the man who sometimes plays the violin 

ARGS: man the 4

A: ("man", "the", 4) => true
B: ("the", "man", 3) => false
C: ("man", "the", 3) => false

Actual result: A | (B - C) => true

Correct result: false (because `the man` means `man` has a proximity of 2 with `the`)

The point is that subtracting 3 to remove the false positives is not enough. A better implementation is to:

Fetch the docids for (word1, word2, prox)
- Subtract from those all the docids for (word2, word1, p) where p < prox - 1
Fetch the docids for (word2, word1, prox - 1)
- Subtract from those all the docids for (word1, word2, p) where p < prox
Union the results of 1 and 2

I am still not certain that is 100% correct or that it is the fastest way to do it.

However, it doesn't really matter because, as @ManyTheFish said:

The reason is that in a lot of cases, your equation 1 | (2 - 3) would be simplified in 1 | 2 because we can consider that run 1-close to forest has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.

So we'll just use 1 | 2 as originally proposed proposed and see if any problem arises.

Benchmarks

I launched some benchmarks where nothing changed except that we don't store leftward proximities anymore, to get an idea of the potential performance improvements made possible by this change. The results are:

Performance comparison between old milli and new milli. Baseline time to index using the old version is 1.5 whereas it is 1.0 for the new one

I would be very happy if we can really reduce the time to index the wiki dataset from 14.5 minutes to 9.4 minutes :)

meilisearch / milli