Search terms are broken when immediately following unicode whitespace

th317erd commented 6 months ago

I have found that any term immediately preceded by Unicode white space (not sure if it is all Unicode white space characters, or just certain ones) will fail to be found... always. I have tried every possible search option, and I can not make it work.

Help!

Example (using zero width space: https://unicode-explorer.com/c/200B):

let items = [
  {
    id: 'broken-item',
    title: '\u200BHello world!',
    //          ^ zero width space
  },
];

let searchEngine = new MiniSearch({ fields: [ 'title' ], searchOptions: { prefix: true, fuzzy: 0.2 } );
searchEngine.addAll(items);

// Will never be found! :'(
let found = searchEngine.search('hello');
// result: []

// Works fine when not preceded by Unicode white space
found = searchEngine.search('world');
// result: [ { id: 'broken-item', ... } ]

lucaong commented 6 months ago

Hi @th317erd , this happens because the default tokenizer splits by space or punctuation and, strangely, the Unicode zero-width space is not considered a space (it belongs to the Other, Format category as opposed to the Separator, Space category as other space characters).

It could be fixed by specifying a custom tokenizer:

const SPACE_OR_PUNCTUATION_INCLUDING_ZERO_WIDTH = /[\n\r\p{Z}\p{P}\u200B-\u200D\uFEFF]/u

const searchEngine = new MiniSearch({
    fields: [ 'title' ],
    searchOptions: { prefix: true, fuzzy: 0.2 },
    tokenize: (text) => text.split(SPACE_OR_PUNCTUATION_INCLUDING_ZERO_WIDTH)
})

let found = searchEngine.search('hello') // This should now work

Alternatively, specify a custom processTerm to remove such characters (this even looks cleaner and more understandable):

const ZERO_WIDTH_SPACES = /[\u200B-\u200D\uFEFF]/ug

const searchEngine = new MiniSearch({
    fields: [ 'title' ],
    searchOptions: { prefix: true, fuzzy: 0.2 },
    processTerm: (term) => term.toLowerCase().replace(ZERO_WIDTH_SPACES, '')
})

I will probably add these characters to the default regexp, as it makes sense to consider them spaces. That said, if you do expect Unicode characters that should be ignored to appear in the text, it's better to apply some normalization to remove them, as shown in the second example above using a custom processTerm. By default, MiniSearch only normalizes case of terms, as normalization is very use-case dependent (for example, with German text one might want to replace umlauts with their longer spelling, like über -> ueber).

lucaong commented 6 months ago

I will soon merge #250 to fix this, and create a new release. In the meanwhile, please use the suggestions above to fix this issue.

Thanks again for reporting this issue!

th317erd commented 6 months ago

Thank you for the information and helpful examples @lucaong!

I noticed the tokenizer and processTerm options, but the documentation on them is very poor, and I couldn't understand how to use them.

Would it be possible for me to submit a PR to update the documentation?

I don't think I understand either even now. The tokenizer seems simple enough. It just takes the entire "source" text (from each field I assume) and splits it into tokens, yes?

processTerm however still confuses me. Your example just made it more confusing. For example:

processTerm: (term) => term.toLowerCase().replace(ZERO_WIDTH_SPACES, '')

What is term here? If I understand the tokenizer correctly, a term should be one of the split parts, a "token" if you will, yes? So, what is term in the context of processTerm, and if it is already a "token", then how would replace(ZERO_WIDTH_SPACES work? Are you saying that the default tokenizer is splitting things such that the zero-width space is remaining part of the first "Hello" term? As in, '\u200BHello' is the first token/term?

As you might be able to see from my confusion, I would like to update the documentation if you could point me at the repo.

FYI: Just a thought, this might all be clearer to other developers if processTerm(term) was instead called processTokenIntoTerm... or even just changing the argument in the documentation from term to token might make things easier to understand. In any case, I feel the documentation needs to be more descriptive and clearer regarding these two callbacks. Happy to help! Just point me in a direction!

lucaong commented 6 months ago

Hi @th317erd , I understand the confusion. Part of it stems from the confusing terminology used in the field of information retrieval, that uses the term "token" when discussing tokenization, and the word "term" when discussing indexing. I chose to use the word "term" everywhere (as it's more familiar to most users), but still refer to tokenization, as it's the appropriate name of that phase. Even in the documentation of the tokenize function, I never use the name "token", and only refer to tokenization as an action that splits text into individual terms.

I just saw that the documentation site does not include docs for the tokenize option, even though such documentation exists. That's because of a typo that prevents TypeDoc to pick up that section. I am fixing it.

Regarding the process that MiniSearch follows, you can think about it as a pipeline:

The extractField function extracts the raw text of the field from the document
The tokenize function splits the raw text into terms, which correspond roughly to words, but not exactly. This often involves some decision making that is dependent on the specific application. For example, in the text "Amy drank some Coca-Cola", should "Coca" and "Cola" be two different terms, or a single "Coca-Cola" term? In the first case, searching for "Cola" would find this result, but could also bring results containing "Cola" but not "Coca-Cola".
Terms are then processed by the processTerm function, for example to normalize them. Think about normalizing case, stemming, replacing diacritics, removing unwanted characters, etc. The results are the terms that will be indexed.
The terms are then added to the inverted index, pointing to the original document

Search queries are also processed with the same tokenize and processTerm functions (unless some search-specific ones are provided, which is only needed in peculiar cases).

Spaces are normally removed by the tokenizer: the default tokenizer of MiniSearch splits by space or punctuation, so the spaces won't be part of the resulting term. Unfortunately, zero-width spaces, despite their name, are not considered to be spaces in Unicode, so the default tokenizer leaves them in. This causes your issue, because the term in the index is '\u200BHello', thus searching for 'Hello' without the zero-width space yields no result (fuzzy match would work, but exact match won't).

In order to deal with zero-width spaces therefore there are two alternative options: either include such Unicode points in the tokenizer regexp or, alternatively, normalize the terms in processTerm to remove zero-width spaces before indexing. The second approach is better, because it will also remove zero-width spaces within terms (like 'He\u200Bllo'). In other words, getting rid of zero-width spaces and other characters that should be ignored is better done as a normalization, in the processTerm function.

I would not rename options at this point, as that would be a breaking change for a large number of users. Improving the documentation is always welcome, and I will fix the issue preventing the documentation of the tokenize option from appearing in the site. Documentation is generated from code comments in this same repo, using TypeDoc.

th317erd commented 6 months ago

Thank you much again @lucaong!

Heavens no! I wasn't recommending you modify the API interface! That would indeed be a very large breaking change for very little value.

I was just suggesting some possible "visual cues" that might help people understand the documentation. However, you provided it entirely:

Regarding the process that MiniSearch follows, you can think about it as a pipeline:

The extractField function extracts the raw text of the field from the document
The tokenize function splits the raw text into terms, which correspond roughly to words, but not exactly. This often involves some decision making that is dependent on the specific application. For example, in the text "Amy drank some Coca-Cola", should "Coca" and "Cola" be two different terms, or a single "Coca-Cola" term? In the first case, searching for "Cola" would find this result, but could also bring results containing "Cola" but not "Coca-Cola".
Terms are then processed by the processTerm function, for example to normalize them. Think about normalizing case, stemming, replacing diacritics, removing unwanted characters, etc. The results are the terms that will be indexed.
The terms are then added to the inverted index, pointing to the original document

☝🏻 This! This is exactly what the documentation needs! Can we please insert this blurb at the top of the searchOptions page?

You said all it takes to modify the documentation is to edit the comments in the source code? Would you approve of me adding this as a blurb to the top of the searchOptions documentation?

lucaong commented 5 months ago

@th317erd sure, documentation improvements are more than welcome 🙂 if you want, send a pull request, I am happy to review and merge it. Otherwise I will try to add a section about this soon.

Your issue already allowed me to detect and fix a problem preventing the documentation for tokenize from showing up, and we might end up improving the docs even further. Thank you!

th317erd commented 5 months ago

As promised: https://github.com/lucaong/minisearch/pull/252

Let me know if you'd like any changes!

Thank you for all the help!

lucaong commented 5 months ago

Awesome! PR approved, I will merge it. Thanks a lot!

th317erd commented 5 months ago

Thank you for all your help!

I am happy to be able to contribute! 😄

lucaong / minisearch

Search terms are broken when immediately following unicode whitespace #249