Closed ManyTheFish closed 1 year ago
Hi, I would like to give it a try. I am not a rust beginner (intermediate), but I also have never contributed to an open-source project 😅.
I have created the new database on index.rs
.
I have created the HashMap on extract_docid_word_positions.rs
like so <Script, RoaringBitmapCodec>
and getting Script and Language provided by Token.
And I also have added the new variant on typed_chunk.rs
and trying to iterate over the HashMap for adding (merging too in case of existing values? ) the pair to the new database.
I am struggling with how to serialize the struct enum Script
from charabia
. This is related to creating a specialized codec.
Script
and Language
? hello @f3r10,
First, could you open a draft PR on this repo? This way I can directly make suggestions on your code in order to help you.
I am struggling with how to serialize the struct enum Script from charabia. This is related to creating a specialized codec.
Indeed, it is an issue, I've created a branch on Charabia impl-serde-on-enums
where Script
and Language
have both a from_name
and name
that allows you to do create your CODEC.
if you don't know how to use my branch replace the charabia dependency in the cargo.toml by:
charabia = { git = "https://github.com/meilisearch/charabia.git", branch = "impl-serde-on-enums", default-features = false }
awesome!! thanks, @ManyTheFish for the feedback. I am going to try to implement the specialized coded.
great it is compiling now!!. Tomorrow I will read about opening a draft PR and push the current changes. .. btw I still have to add the tests 😅
Hi @ManyTheFish, I think I got it. I have just added the corresponding tests. What would be the next steps?
Discussed this with @ManyTheFish, the job was starting with meilisearch/milli#660 but has been merged into a temporary branch meilisearch:enhance-language-detection
. Still need to be merged into main
with the appropriate changes.
@ManyTheFish could define these changes 👀
hey @curquiza, I link below the issue defining the changes to do to have the whole feature: https://github.com/meilisearch/meilisearch/issues/3357
Closed in favor of https://github.com/meilisearch/meilisearch/issues/3357 that we will moved into meilisearch soon 💪
Summary
Meilisearch automatically detects the Script and the Language during indexing and search. Because the searches only contain small texts, it is almost impossible to efficiently detect the used Language. However, during indexing, Meilisearch receives complete documents on which it is easier to detect the Language, And so, instead of knowing the Language used in the search query, we could know the Language used in the data where we search in.
related to: https://github.com/meilisearch/product/discussions/532#discussioncomment-3709627
technical approach
Create a new database
The first step is to create a new database in the index named
script_language_docids
in the Index that stores as the key: theScript
concatenated to theLanguage
and as the value: aRoaringBitmap
containing all the concerned docids, be aware that the key needs a specialized codec.related files:
Extract and index data
During word position extraction we should store the detected languages in a hashmap linked with the docids in order to send the hashmap to the main thread at the end of the extraction task. Then the main thread will have to store these data in the
script_language_docids
database. Be aware that the same document can contain several Languages, and so, should be indexed as the value of several Script/Language pairs.related files:
Delete data
When removing documents, we should take care of removing the corresponding docids from the
script_language_docids
database. Then, when the database is cleared, thescript_language_docids
database should be cleared too.related files:
Todo