pipedown / noise

Nested Object Inverted Search Engine
https://noisesearch.org/
Apache License 2.0
324 stars 11 forks source link

Support stemmer-languages other than english #64

Open OSHistory opened 6 years ago

OSHistory commented 6 years ago

Following discussion from https://github.com/pipedown/noise/issues/30

The wrapper around the snowball stemmer is hard-coded to english. Quite some flexibility could be gained by offering the ability to specify a different language supported natively by the snowball-project. Obviously english should be the default.

From the top of my head: Something along the lines of:

let index = noise.open("myindex", true, { "lang": "german" });

Most use cases should operate on a single language, so multi-language support shouldn't be an issue.

OSHistory commented 6 years ago

Looked into it and managed to build a german-stemming fork. With two minor alterations: change language in src/stem.rs and set my fork of noise in the node-noise fork as the noise dependency. Including the node-noise fork works as expected (text is stemmed with german rules, and fuzzy search works as expected):

Anyone interested in trying out the 'german'-fork can try it out by including my github fork as noise-search dependency:

  "dependencies": {
    "noise-search": "git://github.com/OSHistory/node-noise"
  }

cargo test fails now, obviously, on the stemming part.

Links to my forks:

node-noise fork noise fork

vmx commented 6 years ago

Nice work! I'm glad you figured it out (the whole chain with Node + Rust + RocksDB is far from being obvious). I hope I find some time to add this properly soon.