olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

This looks like I cannot find my Russian static content. #42

Closed Serhioromano closed 11 years ago

Serhioromano commented 11 years ago

I have feed like this

var docs = [

    {
    "id"    : "http://dreamand.me/ru/emerald/developer/s",
    "title"   : "Новое действие",
    "content" : "Дествие это то что сработает после успешной активации...."
},

    {
    "id"    : "http://dreamand.me/ru/cobalt/optimize-links",
    "title"   : "Оптимизация ссылок",
    "content" : "Очень часто получается так, что URLы...."
}
]

and then

var idx = lunr(function () {
    this.field('title', 10);
    this.field('content');
})

for(var index in docs) {
    idx.add(docs[index]);
}

It search well for posts on english but not on Russian. Is there anything I can do to fix it?

Serhioromano commented 11 years ago

I have been able to make it work. When I comment lines 210-212

 return str
    .split(/\s+/)
    /*.map(function (token) {
      return token.replace(/^\W+/, '').replace(/\W+$/, '').toLowerCase()
    })*/

Then i can search russian. It is not smart with endings but it works good.

What I could replace \W+ with to support other UTF characters?

Serhioromano commented 11 years ago

Ok. My little investigation ended up on this page.

http://xregexp.com/plugins/

This is what I think have to be used instead of \W+. But i understand it adds weight to code. may be make it optional. Some parameter to turn on multilingual support?

olivernn commented 11 years ago

Thanks for opening this issue, it looks like you've found a workaround but I thought it might be worth letting you know what is happening.

For the most part lunr is language agnostic, there are a few areas that are english specific though, one of which you have encountered.

  1. lunr.stemmer - by default lunr includes an English specific stemmer, it actually might be worth removing it in your case as I doubt it will work very well with Russian.
  2. lunr.stopWordFilter - this makes sure that very common words don't end up in the index (for performance and size reasons), again by default lunr includes an English stop word list, which is probably not much use with Russian.
  3. lunr.tokenizer - this is actually the part of the code that you have patched, it is what splits words into tokens, and while not specific to English, seems to not work well with languages using non roman(?) characters.

By default both the stemmer and the stopWordFilter are added to a pipeline of text processors, you can either manually remove them:

var idx = lunr(function () {
    this.field('title')
    this.field('body')
    idx.ref('id')
})

idx.pipeline.remove(lunr.stemmer)
idx.pipeline.remove(lunr.stopWordFilter)

Or you can create your index manually:

var idx = new lunr.Index
idx.field('title')
idx.field('body')
idx.ref('id')

Making lunr work with languages other than English is a feature I want to add, there is already an issue with discussion about this here #16 feel free to add any ideas etc there.

I'm thinking of having a separate repo with language specific extensions to lunr, e.g. French, Spanish, Russian etc that you would use to get stemmers and stop word filters for languages other than English, I'd be interested in your opinions though.

Serhioromano commented 11 years ago

My site is using both English and Russian. You can check how lunr works there.

http://mintjoomla.github.io/

This is Jekyll static site and i have to say I am very impressed how fast is lunr to index and search. For russian copy and paste поле and press any non character keyboard key.

I'm thinking of having a separate repo with language specific extensions to lunr, e.g. French, Spanish, Russian etc that you would use to get stemmers and stop word filters for languages other than English, I'd be interested in your opinions though.

This would be ideal. But there is also optimal solution. This is quick fix but it will allow atleast search other languages immediately. may be not as accurate as it could be but at least it works.

This is workaround I use currently.

var preg = XRegExp("^[0-9_\\p{L}]+$");
str = str.replace(/[\.,\-\(\)\:\;\"\'\?\!]+/gi, ' ');

return str
  .split(/\s+/)
  .map(function (token) {
    if(token.length < 3) {
        return false;
    }
    return (preg.test(token) ? token : '').toLowerCase();
  })

this is in lunr.tokenizer. Of course it require 2 extra JS files but those can be combined into one and only 15kb of size.

Of course there is another way around, without extra files just change \W+ for something that understand other non-latin characters.

Anyway if you start language adapters I'll contribute to russian language. Just create adapter template so I can see how it is organized.

Another important thing is when you design you language adapter system, there have to be option to connect few adapters which should be automatically merged.

olivernn commented 11 years ago

@Serhioromano I've put together a very basic implementation of a Russian language extension to lunr, available at - https://github.com/olivernn/lunr.ru.js

I'd be very appreciative if you would take a look and try it out. Obviously I'm not a Russian speaker so I'd be grateful for any tips on the words that are included in the stopWordList as well as some testing with some Russian documents. If you want I can give you commit access to fix things as you see fit.

There are some changes that need to happen in lunr before this is production ready, specifically fixing the problem you came across in this issue. The next version of lunr will have a simpler tokenizer that just splits a string into an array of tokens, the trimming of leading and trailing punctuation will be moved into a separate pipeline function. I know how this will work for English but would appreciate some help in getting a Russian version together too.

Serhioromano commented 11 years ago

@olivernn! You have done a great job. I reviewed files and this is not basic. It is even very advanced. Nothing to add really! You are rock!

My problem is also that index includes as english as russian. I have to be able somehow connect both languages and when I enter russian, search for russian and when I enter english, search for english. Is this possible?

olivernn commented 11 years ago

I'm really glad the Russian adapter looks good!

I think being able to handle mixed languages in the same index is actually quite difficult. This is probably something that needs to be solved outside of lunr to be honest. Perhaps having two indexes, one with your English content, and another with your Russian content. That way you get to take full advantage of the language features of lunr specific to each language.

When performing a search you could either explicitly get people to search in either English or Russian (perhaps an option on the search form?) or even better, try and detect which language the search query is in and then route the query to the corresponding index. I'm guessing that English and Russian are fairly easy to differentiate between, but perhaps something like https://github.com/tomayac/language-identifier or https://github.com/jaukia/cld-js might be useful (disclaimer: I've not used either…)

Let me know how you get along with this, and if there is anything else I can do to help.

Serhioromano commented 11 years ago

Unfortunately I do not plan to change anything in my site right now. It is working with my hak almost as expected.

I think for me best way to add russian stop words and endings to english steamer and stop words. Because creating search like you say will take too much time for my tight schedule. And besides I have documentation site where people may often start entering english definition word and follow with russian. I mean mixed language enter.

I think there should be mechanism that would automatically extends english with any new language added.Something like

lunr.addLanguage('ru-RU')

And it just extend default steamer and stop words. Then I should not care of anything else. Everything should work just perfect.

Serhioromano commented 8 years ago

@olivernn I have returned to this subject. I am creating another static site with search.

Unfortunately I do not remember even a glimpse from the past.

Is there anything changed? Latest Junr still do not support non-latin characters. I was going to patch as I described here but lunr tokenizer function had changed and I do not know where to start.