Recommended setup for searching personal names

dimaqq commented 4 years ago

I'm trying to use lunr.js to search a list of users client-side. The data has personal names, user ids and emails, however names could be in several scripts and languages. I wonder what kind of setup I should use.

fex80 commented 4 years ago

Here's my thoughts:

As your searching names, I would not use stemmers
Not using stemmers, the languages/scripts the names are in do not really matter all that much. Probably, you can just use the plain-vanilla lunr and remove the stemmer.
Adding some typo-friendlyness would be nice, as names tend to have different spellings ("john miller" vs "jon miller")
I would take special care to unify unicode's variations to create the same visual character, so that the search turns up the expected results. Technically, applying some sort of unicode collapsing filter to index and search should do. It's a real rabbit hole, and I would stop at some point but I would think this is really important. Some Links: When "Zoë" !== "Zoë", Some Latin and Cyrillic characters look the same and a more general SO question
To make search results meaningful I'd probably boost the name fields over the emails and user-id fields
As you seem to be based in Japan, special care is needed to properly split names and queries into words, the built-in tokenizer might not be ideal, but it's quite easy to exchange it. The Lunr-languages package seems to have japanese support. They also have the multi-language support, but again, as you are only using names, it's propably overkill (plus: stemming and language-guessing might even reduce search result quality).

olivernn commented 4 years ago

Everything that @fex80 says seems to be excellent advise, not much more I can add. Let us know if you run into any difficulties.

olivernn / lunr.js

Recommended setup for searching personal names #452