olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Make it easier to support other languages. #204

Closed 5amfung closed 8 years ago

5amfung commented 8 years ago

Can we have better instruction on adding support of a new language? I'm willing to help but I don't where to start.

I was testing out https://github.com/nolanlawson/pouchdb-quick-search and found that searching Korean and Chinese didn't work because lunr.js doesn't support. I think, for these two language, a simple substring match as a start would work fine for me. I probably don't need any sophisticated tokenization.

And then there's https://github.com/MihaiValentin/lunr-languages, which adds more confusion to me. It doesn't have instruction how I go about adding a new language even though I want to contribute.

olivernn commented 8 years ago

Yes, the documentation in general is currently quite poor, this is something I need to fix...

In general a plugin that adds language support generally does the following:

  1. Removes the existing, English specific, text-processing functions from an indexes pipeline
  2. Adds a language specific stop word filter
  3. Adds a language specific stemmer
  4. Modifies the tokenizer to handle non ascii characters

A quick search turned up the following resources that may help with the stop word filter, stemmer and tokenizer:

My Korean is sadly limited non-existant so I'm not going to be much help in building any of these pieces, but if you have any specific questions about how to incorporate them into lunr, or what they're supposed to do I can give you some answers.

olivernn commented 8 years ago

I'm going to close this issue, I'm not the right person to provide a Korean language adaptor, though if someone else feels up for it I'd love to know.

If you have any more questions feel free to re-open, alternatively you can ask a question on stackoverflow with the lunrjs tag or ask for help on gitter.