olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Custom Tokenizer #122

Closed mobinseven closed 9 years ago

mobinseven commented 9 years ago

How to use a custom tokenizer method which is for another language? I have a method which returns words like this: //دیوان //اشعار //شامل //غزلیات //قصیده //مثنوی //قطعات //رباعیات how to integrate such method with lunr?

olivernn commented 9 years ago

It is possible to use lunr.js with other languages, take a look at https://github.com/MihaiValentin/lunr-languages for plugins providing support for other languages. Currently it does not have any support for Farsi.

There are two parts that make a language extension, a stemmer and a stop word filter. By default lunr comes with an English stemmer and stop word filter. At a guess I'd say that these are being very confused by Farsi!

Ideally you would create a Farsi stemmer and stop word filter and put them together into a plugin, take a look at some of the implementations in https://github.com/MihaiValentin/lunr-languages for some ideas on how to do this.

Alternatively you should be able to remove then English stemmer and stopword filter and hopefully get some better results.

var idx = lunr(function () {
 this.pipeline.reset()
 this.field('fieldname')
})

Let me know if you manage to come up with a plugin for Farsi, if you do it'd be great to get it added to the https://github.com/MihaiValentin/lunr-languages project for others to use also.

mobinseven commented 9 years ago

I dont know how to generate those stemmers and they didnt provide good documentation. can you give me some advise?