Open mattico opened 6 years ago
Are you planning to add all the languages from that repo?
No, just the ones that I can easily find stemmers for. Right now that's the languages supported by https://github.com/CurrySoftware/rust-stemmers minus Hungarian since that one didn't match lunr-languages' output. There are a few more languages that could be added pretty easily by running the snowball compiler, but I don't think I'll go through the effort unless someone actually wants them.
So If I would like to add support for Polish language I need add it to snowball first?
You need a rust implementation, and a javascript implementation that are both compatible. The snowball compiler is one way to generate both implementations, but you could port an algorithm manually as well.
hi @mattico ! I'd like to help you to get the multilanguage integration happen. could you please provide any guidance?
First, to be clear, multi-language means a search index that supports content that is written in multiple languages. A single document which has multiple languages. We already support searching many languages individually.
Second, the main constraint of the implementation is to be compatible with the Javascript implementation. So the starting point for any addition should be understanding how the Javascript implementation works and converting it. The readme of elasticlunr.js says it can use https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js Tests should be added which generate an index using the javascript implementation and compare it to an index generated using the rust implementation.
More specifically it looks like lunr.multi.js takes a bunch of language pipelines as arguments and combines them together into one. Language pipelines have a few distinct parts which are run sequentially:
\w
. lunr.multi.js just concatenates all the valid characters into one string and then removes the union of the characters for trimming.These all get combined into a pipeline, which is just a list of functions which each get run sequentially on each input token to produce the output token. The MultiLanguage
language can take a number of languages as an argument and then combine them into one pipeline as above.
Thanks a lot! Everything seems to be clear :)
Thank you for such a thorough answer, @mattico!
I've managed to implement support for Russian and English languages. Unfortunately, I did neither made a universal solution for all possible combinations of languages, nor covered it with tests.
I hope I will find some spare time in the near future to implement universal support properly and send a PR.
Btw, I've also encountered a weird issue with IndexBuilder
: for some reason, using IndexBuilder
instead of Index::new
gave me different results, despite the identical parameters. I can't say whether it is an issue with IndexBuilder
itself, or with our overall setup. The issue was fixed by replacing BTreeSet
with Vec
in IndexBuilder
, so perhaps the order of fields affects generated index. I'll create an issue if my further investigation show something.
https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js