Closed loiclec closed 2 years ago
Hello @loiclec,
First, I think the performance loss would rarely be noticeable. The reason is that in a lot of cases, your equation 1 | (2 - 3)
would be simplified in 1 | 2
because we can consider that run
1
-close to forest
has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.
Moreover, I think this improvement could enhance the relevancy. Making the distinction between a swap and a distance of 2 would reduce the probability of returning false positives when computing a PHRASE (in a way, this function may be refactored with this change).
To conclude, I think this is a good improvement and I don't see any drawbacks in terms of relevancy. 😄
That's good to hear :-)
Have you also thought about the impact on the word_prefix_pair_proximity_docids
database and its usage? Since this database is computed from the word_pair_proximity_docids
, it will also be smaller.
I am not too familiar with its exact role in the execution of a search query. I wonder if we'll also want to use the 1 | (2 - 3)
formula (or the simplification you talked about) or if we want to limit ourselves to searching for non-swapped word pairs there (just 1
instead of 1 | 2 - 3
)
Hey @loiclec, I think if both databases have the same logic it's ok, the resulted documents comming from these databases are mainly computed in search/criteria/mod.rs#L450-L524
Below is an update on this issue. It doesn't require any answer, I am mostly writing it here to document the process :)
@ManyTheFish and I discussed a bit more about this issue and found that there is a problem with the word_prefix_pair-proximity_docids
database following the proposed refactor. For example, if someone types:
New York ci
and we have a document in the dataset that contains:
the city of New York
then we would expect that document to be returned among the results based on the prefix ci
matching city
and being in close proximity to new
and york
. However, if ci
is in the word_prefix_pair_proximity_docids
database, then it would only appear as:
the ci 1
and not:
new ci 3
york ci 4
which is what we want.
To solve this problem, we saw two potential solutions:
word_prefix_pair_proximity_docids
database and always find the words corresponding to a prefix via the FST word_prefix_pair_proximity_docids
except that the prefixes come first and are on the left, maybe called prefix_word_pair_proximity_docids
.I have started implementing (2) in #639 .
Furthermore, I also realised that my formula 1 | (2 -3)
is pretty much completely wrong. For example:
TEXT: the man who sometimes plays the violin
ARGS: man the 4
A: ("man", "the", 4) => true
B: ("the", "man", 3) => false
C: ("man", "the", 3) => false
Actual result: A | (B - C) => true
Correct result: false (because `the man` means `man` has a proximity of 2 with `the`)
The point is that subtracting 3
to remove the false positives is not enough. A better implementation is to:
(word1, word2, prox)
(word2, word1, p)
where p < prox - 1
(word2, word1, prox - 1)
(word1, word2, p)
where p < prox
1
and 2
I am still not certain that is 100% correct or that it is the fastest way to do it.
However, it doesn't really matter because, as @ManyTheFish said:
The reason is that in a lot of cases, your equation 1 | (2 - 3) would be simplified in 1 | 2 because we can consider that run 1-close to forest has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.
So we'll just use 1 | 2
as originally proposed proposed and see if any problem arises.
I launched some benchmarks where nothing changed except that we don't store leftward proximities anymore, to get an idea of the potential performance improvements made possible by this change. The results are:
I would be very happy if we can really reduce the time to index the wiki dataset from 14.5 minutes to 9.4 minutes :)
Currently, given the following text:
The keys of the
word_pair_proximity_docids
database are:For each pair of words, we store both the "rightward" proximity and the "leftward" proximity, which is equal to the rightward proximity + 1.
I am wondering whether we could reduce the size of the database by only storing the righward proximity. Given the same text, the database would contain the keys:
This would help in two ways:
I don't think that we would lose any information. In fact, it seems like we would have more information at our disposal to tweak the search results. We can, for example, restrict a search to documents where
word1
is followed byword2
, or we could tweak the "proximity penalty" of backward word pairs, which is currently set to+ 1
(I mean that given the word pairword1 .. word2
with a proximity ofN
, we also recordword2 word1 N+1
).Another example:
currently yields:
and with the suggested change, would yield:
To find the document ids where
run
is2
-close toforest
, we currently search forrun forest 2
and get:With the new scheme, we need to execute three queries:
run forest 2
forest run 1
run forest 1
Then we compute
1 | (2 - 3)
:The idea is that we want:
run forest
has a rightward proximity of2
run forest
has a leftward proximity of1
except those whererun forest
has a rightward proximity of1
, since the overall minimal proximity betweenrun
andforest
in those documents is1
and not2
.This would be slower of course, but I don't know if the performance loss would be noticeable.