Question - Typo's - Githubissues

oasiz commented 2 years ago

Hi, hope you are well!

Firstly, thanks for creating this - it's absolutely brilliant. I'm using it for a home jukebox system that I've been creating over the last couple of weeks and the MiniSearch works so quickly and flawlessly (searching 17k + tracks).

OK my question. How do I show results for (the artist) "P!nk" when "pink" is typed in?

I'd also like to show results for "Meat Loaf" when "meatloaf" is typed in (by mistake).

I can't imagine there'll be too many of these and suppose ideally I'd have an array of "alternatives" as such - just not sure how best to implement it?

Obviously I need to be careful not to mess up other results (think "Pink Floyd" etc).

The only partial solution I have found is to look out for those keywords ("pink", "meatloaf") and then add their real matches ("p!nk", "meat loaf") to the search string before attempting the search. It does work but doesn't feel quite right?

Hope that all makes sense!

Regards,

Rob

lucaong commented 2 years ago

Hi @oasiz , thanks for the kind words!

If you want your search to be robust to typos, usually the best strategy is to use fuzzy match. The simplest way is to just set the fuzzy search option to true:

miniSearch.search('pink', { fuzzy: true })

This will search for documents containing terms that match your search, also allowing for some typos. This will work for pink vs. p!nk.

The defaults are usually good enough, but if you prefer you can also tweak how much "fuzziness" you want to have. Quoting from the documentation about the fuzzy search option:

If a boolean is given, fuzzy search with a default fuzziness parameter is performed if true.

If a number higher or equal to 1 is given, fuzzy search is performed, with a maximum edit distance (Levenshtein) equal to the number.

If a number between 0 and 1 is given, fuzzy search is performed within a maximum edit distance corresponding to that fraction of the term length, approximated to the nearest integer. For example, 0.2 would mean an edit distance of 20% of the term length, so 1 character in a 5-characters term. The calculated fuzziness value is limited by the maxFuzzy option, to prevent slowdown for very long queries.

Otherwise, for known alternative spellings, what you propose is also a good solution. That would work better for cases that are not just small typos, like meat loaf vs. meatloaf, or Rage Against the Machine vs RATM. Usually, though, instead of adding them to the search, I add them as an additional field on the documents (it's look if only some documents have such field):

const documents = [
  { artist: 'Rage Against the Machine', alternativeNames: 'RATM', title: 'Renegades of Funk' },
  { artist: 'Red Hot Chili Peppers', alternativeNames: 'RHCP', title: "True Men Don't Kill Coyotes" },
  { artist: 'Black Rebel Motorcycle Club', alternativeNames: 'BRMC', title: 'Love Burns' },
  { artist: 'AC/DC', alternativeNames: 'ACDC', title: 'Jailbreak' },
  { artist: 'Arctic Monkeys', title: 'Are U Mine' }
  // ...and so on
]

const miniSearch = new MiniSearch({
  fields: ['artist', 'alternativeNames', 'title'],
  searchOptions: { fuzzy: true }
})

miniSearch.search('ratm') // this will return results for Rage Against the Machine

I tend to combine both strategies in many applications: one strategy is good for known alternative spellings, the other for the occasional typo.

I hope this helps.

oasiz commented 2 years ago

Hi, thanks so much for the detailed response, much appreciated!

I've actually tried adding the 'fuzzy': true parameter and it's not working for pink/p!nk - although I'll give it another go shortly and study the documentation a little more just in case I've missed something.

Rob

lucaong commented 2 years ago

Ah you are right, the issue with p!nk is that by default the tokenizer splits by space or punctuation, so p!nk is tokenized as ["p", "nk"]. The other approach discussed above would work better for this case. The fuzzy match would still help for typos like Areosmith vs Aerosmith.

It is possible to use a custom tokenizer that splits only by whitespace. It might or might not be a good choice for your case: on the good side, names like p!nk or AC/DC would be tokenized more correctly, and work with fuzzy search. On the flip side, fields or search queries that can contain punctuation could be tokenized wrong.

If you are curious, this is how you can use a tokenizer that splits only by space:

const miniSearch = new MiniSearch({
  fields: [/* ...set your fields... */],
  tokenize: (text) => text.split(/\s+/) // Use a custom tokenizer that splits only by space
})

// ...add documents to the index...

miniSearch.search('pink', { fuzzy: true }) // This should now match p!nk

oasiz commented 2 years ago

Thanks for confirming that @lucaong - I was convinced I'd managed to cause an error somehow!

Shall have a play around with the tokenizer.

All in all it now works great though, my searches for P!nk, Meat Loaf etc all work flawlessly - thanks for your support :-)

lucaong commented 2 years ago

You are welcome :) I will close the issue for now, but feel free to comment further if you need more information.

lucaong / minisearch

Question - Typo's #143