Store detected Language per document during indexing

ManyTheFish commented 1 year ago

⚠️: This issue is not an easy one, it requires some knowledge in Rust and more work than the other issues. I highly encourage beginners to take another issue.

Summary

Meilisearch automatically detects the Script and the Language during indexing and search. Because the searches only contain small texts, it is almost impossible to efficiently detect the used Language. However, during indexing, Meilisearch receives complete documents on which it is easier to detect the Language, And so, instead of knowing the Language used in the search query, we could know the Language used in the data where we search in.

technical approach

Create a new database

The first step is to create a new database in the index named script_language_docids in the Index that stores as the key: the Script concatenated to the Language and as the value: a RoaringBitmap containing all the concerned docids, be aware that the key needs a specialized codec.

related files:

Extract and index data

During word position extraction we should store the detected languages in a hashmap linked with the docids in order to send the hashmap to the main thread at the end of the extraction task. Then the main thread will have to store these data in the script_language_docids database. Be aware that the same document can contain several Languages, and so, should be indexed as the value of several Script/Language pairs.

related files:

Delete data

When removing documents, we should take care of removing the corresponding docids from the script_language_docids database. Then, when the database is cleared, the script_language_docids database should be cleared too.

related files:

Todo

[x] create a new database
- [x] implementation
[x] update this database during indexing
- [x] implementation
- [x] tests
[x] update this database during deletion
- [x] implementation
- [x] tests

f3r10 commented 1 year ago

Hi, I would like to give it a try. I am not a rust beginner (intermediate), but I also have never contributed to an open-source project 😅.

I have created the new database on index.rs.
I have created the HashMap on extract_docid_word_positions.rs like so <Script, RoaringBitmapCodec> and getting Script and Language provided by Token.
- In this part, the key should be Script + Language creating a specialized codec
And I also have added the new variant on typed_chunk.rs and trying to iterate over the HashMap for adding (merging too in case of existing values? ) the pair to the new database.
I am struggling with how to serialize the struct enum Script from charabia. This is related to creating a specialized codec.
- Would it be necessary to use serde and bincode on charabia for deriving the serialization on Script and Language?
- or is another way?

ManyTheFish commented 1 year ago

hello @f3r10,

First, could you open a draft PR on this repo? This way I can directly make suggestions on your code in order to help you.

I am struggling with how to serialize the struct enum Script from charabia. This is related to creating a specialized codec.

Indeed, it is an issue, I've created a branch on Charabia impl-serde-on-enums where Script and Language have both a from_name and name that allows you to do create your CODEC. if you don't know how to use my branch replace the charabia dependency in the cargo.toml by:

charabia = { git = "https://github.com/meilisearch/charabia.git", branch = "impl-serde-on-enums", default-features = false }

f3r10 commented 1 year ago

awesome!! thanks, @ManyTheFish for the feedback. I am going to try to implement the specialized coded.

f3r10 commented 1 year ago

great it is compiling now!!. Tomorrow I will read about opening a draft PR and push the current changes. .. btw I still have to add the tests 😅

f3r10 commented 1 year ago

Hi @ManyTheFish, I think I got it. I have just added the corresponding tests. What would be the next steps?

curquiza commented 1 year ago

Discussed this with @ManyTheFish, the job was starting with meilisearch/milli#660 but has been merged into a temporary branch meilisearch:enhance-language-detection. Still need to be merged into main with the appropriate changes.

@ManyTheFish could define these changes 👀

ManyTheFish commented 1 year ago

hey @curquiza, I link below the issue defining the changes to do to have the whole feature: https://github.com/meilisearch/meilisearch/issues/3357

curquiza commented 1 year ago

Closed in favor of https://github.com/meilisearch/meilisearch/issues/3357 that we will moved into meilisearch soon 💪

meilisearch / milli