Open guirip opened 7 years ago
In a way Lunr is doing the right thing here, as there is no entry in the index for 'genéral', i.e. it thinks 'e' and 'é' are different characters (which technically they are) and so doesn't find anything.
There is a lunr-unicode-normalizer plugin which attempts to normalise characters by removing diacritical marks. It looks like it probably needs upgrading to support Lunr 2, but this shouldn't be too difficult, this guide should hopefully show what is required.
Let me know if that solves the problem.
Hello
Thanks for the link !
I am sorry but I don't have time at all to dig in to migrate the plugin (being almost alone on a big app with an approaching deadline, will mostly get home again at 10pm), but I copy/pasted most of it (with a link to original source in the jsdoc) and its works like a charm. I use it to normalize input on indexes creation, and to normalize given 'input' on user search.
Just wanted to say that I made a npm compatible version of the lunr-unicode-normalizer and it seems to be working just fine. Should work in the browser as well, but I didn't try. https://github.com/nekdolan/lunr-unicode-normalizer It works the same as the language plugins.
@nekdolan Thanks for the tip
@nekdolan Nice work wrapping that up in an NPM module.
One modification you might be interested in is that, in Lunr 2.x at least, a tokenizer can be specified per index, this should mean that you no longer need to ~monkey~ freedom patch lunr.tokenizer
.
@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77
@bulby97 I didn't realize that the wiki I used was using version 1 instead of 2. Thanks for the upgrade.
One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). 🤓
One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). nerd_face
Yes, the proper name for it is "character folding": http://www.unicode.org/reports/tr30/tr30-4.html since "normalization" preserves (more or less...) the original glyphs.
Whoosh has very nice documentation: https://whoosh.readthedocs.io/en/latest/stemming.html#character-folding Lucene has a very fancy implementation: https://lucene.apache.org/core/9_6_0/analysis/icu/index.htmlhttps://lucene.apache.org/core/9_6_0/analysis/icu/index.html
@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77
Sadly, this is 404. I'll do another one and put it on NPM shortly, as I need this (despite its weird and somewhat antiquated API lunr.js appears to still be the best JavaScript search library out there)
Here you go: https://www.npmjs.com/package/lunr-folding
@dhdaines Are you able to provide a demo of how to use this? Every time I try to incorporate it into my project my existing search function stops returning any results; I'm not super well versed in javascript so I'm not sure exactly where I'm going wrong. Thanks in advance
@dhdaines Are you able to provide a demo of how to use this?
Sorry for the delay! There's an example now at https://www.npmjs.com/package/lunr-folding - but due to some JavaScript weirdness that I don't understand at all, it's slightly wrong. You should be able to install:
npm install lunr lunr-folding
Then run:
const lunr = require("lunr");
const folding = require("lunr-folding").default;
folding(lunr);
const idx = lunr(function () {
this.ref("id");
this.field("text");
this.add({ id: "1", text: "Étape 1: Collecter des bobettes" });
this.add({ id: "2", text: "Étape 2: ???" });
this.add({ id: "3", text: "Étape 3: Profit" });
});
const results = idx.search("etape 3");
console.log(JSON.stringify(results[0]));
@dhdaines Thanks very much!
Hello Oliver, I hope you're doing fine.
I come back to you with this fiddle, where you can see that some data has accented characters. e.g 'général'
If the user searches for 'général', matching entry is returned as expected.
If the user searches for 'genéral', no entry is returned.
How would you advise me to get this working ? Is there a simpler way than removing accents when building indexes and from user input too ?