m31coding / fuzzy-search

A fast, accurate and multilingual fuzzy search library for the frontend.
MIT License
685 stars 12 forks source link

Prefix match is ranked lower than expected #2

Open MoonZ opened 7 months ago

MoonZ commented 7 months ago

Following the HN article, quickly tried the demo on the places set but came up with weird results:

image

How come a result that doesn't match the input string gets a better quality score than the one that includes the entry ? And how come another one not maching is having the same score just below it ? If it's due to length having a closer match, I then don't understand the advantage for making it real time ? (If I need to type my string almost entirely to make my targeted result come up in the visible list, then "time gained" by indeed good performances is already lost)

But anyway, performances are very good indeed, just trying to figure out the use case.

m31coding commented 7 months ago

Hi, thank you very much for reporting this observation. I think this is an interesting topic for many people. Carcasse and Carasso are good matches for the query carcasso because the words are similar. To be precise, the computed quality is the number of common n-grams between the query and the matched term, divided by the number of n-grams of the longer string.

Most fuzzy searchers are based on such string similarities, related heuristics, or well-defined string metrics. That being said, I can definitely see where you are coming from. A correctly typed prefix could potentially yield a higher-quality match than what is currently computed.

One solution that can be adopted is to implement a search controller that queries a fuzzy searcher as well as a suffix array searcher. The latter is good at finding prefix and suffix matches. The result matches of both can then be distinct and combined. However, I consider this approach to be out of scope for this library.

To get closer to your desired behaviour with the library as is, you could boost the qualities of prefix matches like so:

const query = new fuzzySearch.Query(queryString, 50, 0.2);
const result = searcher.getMatches(query);
for (const match of result.matches) {
  if (match.matchedString.toLowerCase().normalize('NFKD').startsWith(query.string.toLowerCase().normalize('NFKD'))) {
    match.quality = Math.pow(match.quality, 0.5);
  }
}
result.matches.sort((m1, m2) => m2.quality - m1.quality);

The lower the exponent, the more weight is given to prefix matches. With my choice of 0.5 Carcassonne is at rank 1 with a quality of 0.8 for the query carcasso. The quality of the other results don't change because their prefixes don't match.

I will try to find time in Q2 to investigate whether something like this can be implemented in the core of the searcher in a proper and performant way.