Enhancement proposal - be permissive about typos when searching

Hey,

This would be a nice feature indeed.

there is an existing mapping for common English misspelled words, ...

I don't think a hard coded list will work, no. Fortunately, there are other solutions :)

We need to consider two things IMO: how to match "approximately", and when to match approximatly.

How

Fuzzy queries (which allow terms with one or two typos) are a thing, but I'd personally stay away from them, because:

They reach their limits quite fast, and then you have to switch to a completely different solution.
They are not available everywhere; e.g. I'm not sure we can use them "by default" in the simple query strings we're using right now in search.

A better approach is to have dedicated fields using an ngram analyzer, e.g. turn tokens into a list of 3-grams:

Searched: aplication => [apl, pli, lic, ica, ati, tio, ion]
Indexed: application => [app, ppl, pli, lic, ica, ati, tio, ion]
Common tokens: `[pli, lic, ica, ati, tio, ion]; that's enough to get a good score!

When

We could do a "OR" between the current search criteria and the new "fuzzy" ones, but this means that, when searching without typos, we will return a long tail of potentially irrelevant results.

A perhaps better solution would be to run the search without typo support first, and only if we notice that search doesn't match anything, ignore it, then run another search with typo support (more fuzzy), then return the results of that second search.

Resources

I tried to explain how to do ngram search here: https://discourse.hibernate.org/t/slop-does-not-work-for-any-word/9253/6?u=yrodiere

As I mentioned above though, we probably don't want to put all predicates in the same query, but rather do something like this:

var results = doSearchWithoutTypoSupport(params);
if (results.total().hitCountLowerBound() == 0) {
   results = doSearchWithTypoSupportUsingNgrams(params);
}
return results;

PRs welcome :)

quarkusio / search.quarkus.io

Enhancement proposal - be permissive about typos when searching #306

How

When

Resources