Default tokenize function splits words containing hyphens

sunknudsen commented 1 year ago

See https://github.com/lucaong/minisearch/blob/1eb584c749ead8209fe4f18132e3f3a693df395e/src/MiniSearch.ts#L1934

Curious… couldn’t we use /[^\w-]+/g instead? Btw, thanks for minisearch @lucaong! Very helpful package. 🙌

lucaong commented 1 year ago

Hi @sunknudsen , Thanks for the kind words :)

Yes, the default tokenizer splits by space or punctuation, and the hyphen is considered punctuation. Therefore, "foo-bar" is tokenized as ["foo", "bar"]. Applications can configure a custom tokenizer to change this behavior. For example, to split by /[^\w-]/ one can do:

const miniSearch = new MiniSearch({
  fields: [/* ...my fields */],
  tokenize: (text) => text.split(/[^\w-]/)
})

Consider that splitting by /[^\w-]/ might work for simple English text, but behaves badly with non-ASCII characters like accents, umlauts, diacritics, etc., which are considered non-word characters even though are very common in languages other than English.

For example:

const tokenize = (text) => text.split(/[^\w-]/)

// This is fine:
tokenize("I'm drinking a coca-cola")
// => ["I", "m", "drinking", "a", "coca-cola"]

// But this is not:
tokenize("The tokenizer is too naïve")
// => ["The", "tokenizer", "is", "too", "na", "ve"]

sunknudsen commented 1 year ago

Excellent points… I naively (pun intended) expected that \w included accented characters.

lucaong commented 1 year ago

On modern browsers this much simpler regular expression gets the job done: /[\p{Z}\p{P}]/u. I need to check compatibility and see if it makes sense to use that instead of the huge explicit form you linked from the source. The original reason for the long regexp was browser compatibility.

See this transpiler from ES6 Unicode-aware regular expressions to ES5.

lucaong commented 1 year ago

https://github.com/lucaong/minisearch/pull/198 translates the long regexp to an equivalent short Unicode one. I will research the implications for older browsers, and I might need to release it as part of the next major release, but it looks like the browser support is basically universal by now, if we don't need to care about IE anymore.

Thanks for bringing my attention to it :)

lucaong / minisearch

Default tokenize function splits words containing hyphens #197