mattico / elasticlunr-rs

A partial port of elasticlunr to Rust. Intended to be used for generating compatible search indices.
Apache License 2.0
51 stars 23 forks source link

Add multilanguage support #13

Open mattico opened 6 years ago

mattico commented 6 years ago

https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js

Keats commented 6 years ago

Are you planning to add all the languages from that repo?

mattico commented 6 years ago

No, just the ones that I can easily find stemmers for. Right now that's the languages supported by https://github.com/CurrySoftware/rust-stemmers minus Hungarian since that one didn't match lunr-languages' output. There are a few more languages that could be added pretty easily by running the snowball compiler, but I don't think I'll go through the effort unless someone actually wants them.

xoac commented 3 years ago

So If I would like to add support for Polish language I need add it to snowball first?

mattico commented 3 years ago

You need a rust implementation, and a javascript implementation that are both compatible. The snowball compiler is one way to generate both implementations, but you could port an algorithm manually as well.

mexus commented 3 years ago

hi @mattico ! I'd like to help you to get the multilanguage integration happen. could you please provide any guidance?

mattico commented 3 years ago

First, to be clear, multi-language means a search index that supports content that is written in multiple languages. A single document which has multiple languages. We already support searching many languages individually.

Second, the main constraint of the implementation is to be compatible with the Javascript implementation. So the starting point for any addition should be understanding how the Javascript implementation works and converting it. The readme of elasticlunr.js says it can use https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js Tests should be added which generate an index using the javascript implementation and compare it to an index generated using the rust implementation.

More specifically it looks like lunr.multi.js takes a bunch of language pipelines as arguments and combines them together into one. Language pipelines have a few distinct parts which are run sequentially:

  1. A tokenizer, which splits the input text into words where there are whitespace characters. Supporting some languages properly is more difficult, e.g. Chinese doesn't generally use spaces to delineate words per-se so it needs a segmentation algorithm which could not be combined in this way and would need to be run sequentially. We are limited, though, by being compatible with the Javascript implementation. If we want to do things properly we could ship our own modified javascript plugin for people to use.
  2. A trimmer, which removes invalid characters from the beginning and end of words. You can see that english just uses the regex \w. lunr.multi.js just concatenates all the valid characters into one string and then removes the union of the characters for trimming.
  3. A stop word filter, which just removes words which make search results worse: (https://en.wikipedia.org/wiki/Stop_word). Again these can just be combined into one large stop word filter, just a HashSet in our case.
  4. A stemmer (https://en.wikipedia.org/wiki/Stemming) which reduces words to their basic form, removing prefixes and suffixes, etc. These can be very different code for each language so the only option is to run each stemmer sequentially on each input word.

These all get combined into a pipeline, which is just a list of functions which each get run sequentially on each input token to produce the output token. The MultiLanguage language can take a number of languages as an argument and then combine them into one pipeline as above.

mexus commented 3 years ago

Thanks a lot! Everything seems to be clear :)

vitvakatu commented 3 years ago

Thank you for such a thorough answer, @mattico!

I've managed to implement support for Russian and English languages. Unfortunately, I did neither made a universal solution for all possible combinations of languages, nor covered it with tests.

I hope I will find some spare time in the near future to implement universal support properly and send a PR.

Btw, I've also encountered a weird issue with IndexBuilder: for some reason, using IndexBuilder instead of Index::new gave me different results, despite the identical parameters. I can't say whether it is an issue with IndexBuilder itself, or with our overall setup. The issue was fixed by replacing BTreeSet with Vec in IndexBuilder, so perhaps the order of fields affects generated index. I'll create an issue if my further investigation show something.