Question: working with accented character

guirip commented 7 years ago

Hello Oliver, I hope you're doing fine.

I come back to you with this fiddle, where you can see that some data has accented characters. e.g 'général'

If the user searches for 'général', matching entry is returned as expected.
If the user searches for 'genéral', no entry is returned.

How would you advise me to get this working ? Is there a simpler way than removing accents when building indexes and from user input too ?

olivernn commented 7 years ago

In a way Lunr is doing the right thing here, as there is no entry in the index for 'genéral', i.e. it thinks 'e' and 'é' are different characters (which technically they are) and so doesn't find anything.

There is a lunr-unicode-normalizer plugin which attempts to normalise characters by removing diacritical marks. It looks like it probably needs upgrading to support Lunr 2, but this shouldn't be too difficult, this guide should hopefully show what is required.

Let me know if that solves the problem.

guirip commented 7 years ago

Hello

Thanks for the link !

I am sorry but I don't have time at all to dig in to migrate the plugin (being almost alone on a big app with an approaching deadline, will mostly get home again at 10pm), but I copy/pasted most of it (with a link to original source in the jsdoc) and its works like a charm. I use it to normalize input on indexes creation, and to normalize given 'input' on user search.

nekdolan commented 5 years ago

Just wanted to say that I made a npm compatible version of the lunr-unicode-normalizer and it seems to be working just fine. Should work in the browser as well, but I didn't try. https://github.com/nekdolan/lunr-unicode-normalizer It works the same as the language plugins.

guirip commented 5 years ago

@nekdolan Thanks for the tip

olivernn commented 5 years ago

@nekdolan Nice work wrapping that up in an NPM module.

One modification you might be interested in is that, in Lunr 2.x at least, a tokenizer can be specified per index, this should mean that you no longer need to ~monkey~ freedom patch lunr.tokenizer.

smontlouis commented 5 years ago

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

nekdolan commented 5 years ago

@bulby97 I didn't realize that the wiki I used was using version 1 instead of 2. Thanks for the upgrade.

Ecco commented 4 years ago

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). 🤓

dhdaines commented 1 year ago

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). nerd_face

Yes, the proper name for it is "character folding": http://www.unicode.org/reports/tr30/tr30-4.html since "normalization" preserves (more or less...) the original glyphs.

Whoosh has very nice documentation: https://whoosh.readthedocs.io/en/latest/stemming.html#character-folding Lucene has a very fancy implementation: https://lucene.apache.org/core/9_6_0/analysis/icu/index.htmlhttps://lucene.apache.org/core/9_6_0/analysis/icu/index.html

dhdaines commented 1 year ago

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

Sadly, this is 404. I'll do another one and put it on NPM shortly, as I need this (despite its weird and somewhat antiquated API lunr.js appears to still be the best JavaScript search library out there)

dhdaines commented 1 year ago

Here you go: https://www.npmjs.com/package/lunr-folding

pomeloshark commented 1 year ago

@dhdaines Are you able to provide a demo of how to use this? Every time I try to incorporate it into my project my existing search function stops returning any results; I'm not super well versed in javascript so I'm not sure exactly where I'm going wrong. Thanks in advance

dhdaines commented 1 year ago

@dhdaines Are you able to provide a demo of how to use this?

Sorry for the delay! There's an example now at https://www.npmjs.com/package/lunr-folding - but due to some JavaScript weirdness that I don't understand at all, it's slightly wrong. You should be able to install:

npm install lunr lunr-folding

Then run:

const lunr = require("lunr");
const folding = require("lunr-folding").default;
folding(lunr);

const idx = lunr(function () {
    this.ref("id");
    this.field("text");
    this.add({ id: "1", text: "Étape 1: Collecter des bobettes" });
    this.add({ id: "2", text: "Étape 2: ???" });
    this.add({ id: "3", text: "Étape 3: Profit" });
});
const results = idx.search("etape 3");
console.log(JSON.stringify(results[0]));

pomeloshark commented 12 months ago

@dhdaines Thanks very much!

olivernn / lunr.js

Question: working with accented character #269