olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Normalize unicode to avoid index corruption; fixes issue #279 #359

Closed coreyward closed 6 years ago

coreyward commented 6 years ago

I know there's interest in rebuilding the tokenizer to resolve the issues with the trimmer that cause this bug, but for a short-term fix this does the trick. Would be great to make this available now to provide immediate relief so people don't have to switch to another library.

olivernn commented 6 years ago

Would normalising all tokens entering the index at build time and query time also work? That way this could be packaged as a pipeline function rather than requiring a patch in the index directly?

coreyward commented 6 years ago

Not sure. What’s the rationale for making it a pipeline function?

I’d be happy to test in an app where it’s failing without this change if you put together a branch with it done this way. On Jul 19, 2018, 1:37 PM -0500, Oliver Nightingale , wrote:

Would normalising all tokens entering the index at build time and query time also work? That way this could be packaged as a pipeline function rather than requiring a patch in the index directly? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

lucaong commented 6 years ago

from my experiment, doing this in a pipeline function (in pipeline and searchPipeline) does not seem to work: https://jsfiddle.net/egLzL24L/19/

lucaong commented 6 years ago

on the other hand, it does not really matter what you do with the expandedTerm string, normalizing is not actually needed. Just calling any method defined on String.prototype (even without actually using the result) fixes it:

https://jsfiddle.net/egLzL24L/40/ (see the .charAt(0) on line 17)

Interestingly, calling methods higher in the prototype chain (like on Object.prototype), does not work. It seems like Safari "casts" the string to its correct representation only when a string method is first called on it. Very weird corner case.

olivernn commented 6 years ago

I think going with @lucaong's approach is probably the right way to work around this for now, lets continue discussion in #361