Now, the value corresponding to the key prox word1 word2 in the word_pair_proximity_docids database contains the ids of the documents in which:
word1 is followed by word2
the minimum number of words between word1 and word2 is prox-1
Before this PR, the word_pair_proximity_docids had keys with the format word1 word2 prox and the value contained the ids of the documents in which either:
word1 is followed by word2 after a minimum of prox-1 words in between them
word2 is followed by word1 after a minimum of prox-2 words
As a consequence of this change, calls such as:
let docids = word_pair_proximity_docids.get(rtxn, (word1, word2, prox));
have to be replaced with:
let docids1 = word_pair_proximity_docids.get(rtxn, (prox, word1, word2)) ;
let docids2 = word_pair_proximity_docids.get(rtxn, (prox-1, word2, word1)) ;
let docids = docids1 | docids2;
Phrase search
The PR also fixes two bugs in the resolve_phrase function. The first bug is that a phrase containing twice the same word would always return zero documents (e.g. "dog eats dog").
The second bug occurs with a phrase such as "fox is smarter than a dog"` and the document with the text:
fox or dog? a fox is smarter than a dog
In that case, the phrase search would not return the documents because:
we only have the key fox dog 2 in word_pair_proximity_docids
but the implementation of resolve_phrase looks for fox dog 5, which returns 0 documents
New implementation of resolve_phrase
Given the phrase:
fox is smarter than a dog
We select the document ids corresponding to all of the following keys in word_pair_proximity_docids:
1 fox is
1 is smarter
1 smarter than
(etc.)
1 fox smarter OR 2 fox smarter
1 is than OR 2 is than
...
1 than dog OR 2 than dog
Benchmark Results
Indexing:
group indexing_main_d94339a8 indexing_word-pair-proximity-docids-refactor_2983dd8e
----- ---------------------- -----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.19 40.7±11.28ms ? ?/sec 1.00 34.3±4.16ms ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.62 11.3±3.77ms ? ?/sec 1.00 7.0±1.56ms ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.00 12.5±2.62ms ? ?/sec 1.07 13.4±4.24ms ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.26 50.2±12.63ms ? ?/sec 1.00 39.8±20.25ms ? ?/sec
indexing/-wiki-delete-searchable- 1.83 269.1±16.11ms ? ?/sec 1.00 146.8±6.12ms ? ?/sec
indexing/Indexing geo_point 1.00 47.2±0.46s ? ?/sec 1.00 47.3±0.56s ? ?/sec
indexing/Indexing movies in three batches 1.42 12.7±0.13s ? ?/sec 1.00 9.0±0.07s ? ?/sec
indexing/Indexing movies with default settings 1.40 10.2±0.07s ? ?/sec 1.00 7.3±0.06s ? ?/sec
indexing/Indexing nested movies with default settings 1.22 7.8±0.11s ? ?/sec 1.00 6.4±0.13s ? ?/sec
indexing/Indexing nested movies without any facets 1.24 7.3±0.07s ? ?/sec 1.00 5.9±0.06s ? ?/sec
indexing/Indexing songs in three batches with default settings 1.14 47.6±0.67s ? ?/sec 1.00 41.8±0.63s ? ?/sec
indexing/Indexing songs with default settings 1.13 44.1±0.74s ? ?/sec 1.00 38.9±0.76s ? ?/sec
indexing/Indexing songs without any facets 1.19 42.0±0.66s ? ?/sec 1.00 35.2±0.48s ? ?/sec
indexing/Indexing songs without faceted numbers 1.20 44.3±1.40s ? ?/sec 1.00 37.0±0.48s ? ?/sec
indexing/Indexing wiki 1.39 862.9±9.95s ? ?/sec 1.00 622.6±27.11s ? ?/sec
indexing/Indexing wiki in three batches 1.40 934.4±5.97s ? ?/sec 1.00 665.7±4.72s ? ?/sec
indexing/Reindexing geo_point 1.01 15.9±0.39s ? ?/sec 1.00 15.7±0.28s ? ?/sec
indexing/Reindexing movies with default settings 1.15 288.8±25.03ms ? ?/sec 1.00 250.4±2.23ms ? ?/sec
indexing/Reindexing songs with default settings 1.01 4.1±0.06s ? ?/sec 1.00 4.1±0.03s ? ?/sec
indexing/Reindexing wiki 1.41 1484.7±20.59s ? ?/sec 1.00 1052.0±19.89s ? ?/sec
Pull Request
What does this PR do?
Fixes #634
Now, the value corresponding to the key
prox word1 word2
in theword_pair_proximity_docids
database contains the ids of the documents in which:word1
is followed byword2
word1
andword2
isprox-1
Before this PR, the
word_pair_proximity_docids
had keys with the formatword1 word2 prox
and the value contained the ids of the documents in which either:word1
is followed byword2
after a minimum ofprox-1
words in between themword2
is followed byword1
after a minimum ofprox-2
wordsAs a consequence of this change, calls such as:
have to be replaced with:
Phrase search
The PR also fixes two bugs in the
resolve_phrase
function. The first bug is that a phrase containing twice the same word would always return zero documents (e.g."dog eats dog"
).The second bug occurs with a phrase such as "fox is smarter than a dog"` and the document with the text:
In that case, the phrase search would not return the documents because:
fox dog 2
inword_pair_proximity_docids
resolve_phrase
looks forfox dog 5
, which returns 0 documentsNew implementation of
resolve_phrase
Given the phrase:
We select the document ids corresponding to all of the following keys in
word_pair_proximity_docids
:1 fox is
1 is smarter
1 smarter than
1 fox smarter
OR2 fox smarter
1 is than
OR2 is than
1 than dog
OR2 than dog
Benchmark Results
Indexing:
Search Wiki:
Search songs: