Closed ManyTheFish closed 2 years ago
Hi @ManyTheFish, @curquiza, this sounds interesting, and would like to work on it if not been taken already. I have a few queries, it would be great if you could answer them. Thanks in advance.
I'd like to work on this issue
Hi @ManyTheFish, @curquiza, this sounds interesting, and would like to work on it if not been taken already. I have a few queries, it would be great if you could answer them. Thanks in advance.
Does function split_best_frequency, now needs to search for word pairs in word_pair_proximity_docids apart from individual word search in word_docids?
- What will be the value of proximity when we search for pairs in word_pair_proximity_docids? Proximity is of type u8, do we need to search for all proximity values from 0 to 255 for each word split combination? This will not be efficient though.
split_best_frequency
should only compute the probability of the pair of words side to side, for the implementation you will only need to check a word_pair_proximity_docids
with proximity of 1. Something like:
let key = (left_word, right_word, 1);
self.index.word_pair_proximity_docids.get(self.rtxn, &key)
you can get more or less copy-paste the method word_pair_proximity_docids in the criteria.
- For writing test cases for this implementation, do we need to add a test database that mimics word_pair_proximity_docids similar to postings used for word_docids in the TestContext?
No, it's not necessary
Hello @akki1306 and @fuushyn
Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this! Our contributing guidelines are available here and will guide you through your contribution. If you have any question you can also ask them here!
FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue. We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.
We are looking forward to reviewing your PR 😊
@ManyTheFish I am trying to run cargo test
and getting the following exception for some of the tests intermittently, the error points to IO error with invalid argument, its not able to create TempIndex.
thread 'update::word_prefix_pair_proximity_docids::tests::test_update' panicked at 'called
Result::unwrap()on an
Errvalue: IoError(Os { code: 22, kind: InvalidInput, message: "Invalid argument" })', milli/src/index.rs:1224:62
If I use cargo test -- --test-threads=1
, this error doesn't occur. Also, on running these tests individually they do not fail.
I am using MACOS Monterey 12.6.
Am I missing something while running the test?
Because the tests are running on several threads, you'll have a lot of opened files at the same time, I suggest increasing the limit of opened files.
ulimit -Sn 10000
should fix your issue 😄
Thanks for the reply @ManyTheFish, I could run the tests successfully after increasing the open file count :). I have raised the PR. Please take a look.
Today the word splitting strategy of the query tree is handled by the function
split_best_frequency
. This function split a word into two sub-words by looking at the frequency of the less frequent sub-word of each possible pair.drawback
However, this frequency computation doesn't represent faithfully the frequency of the pair because these two sub-words can be considered individually frequent without being frequently near together in documents.
possible enhancement
Inspired by the
index.rs#word_documents_count
method, a new methodword_pair_frequency
could be implemented in the traitsearch/query_tree.rs#Context
using theword_pair_proximity_docids
database instead of theword_docids
one.small warnings
Context
is derived 2 times, one in search/query_tree.rs#QueryTreeBuilder which is the real implementation, and one in the testsword_pair_proximity_docids
has a different CODEC thanword_docids
but a length decoder is already implemented in heed_codec/roaring_bitmap_length/Files expected to be modified