quarkusio / search.quarkus.io

Search backend for Quarkus websites
Apache License 2.0
1 stars 6 forks source link

Enhancement proposal - be permissive about typos when searching #306

Open rsvoboda opened 1 month ago

rsvoboda commented 1 month ago

I have an enhancement proposal to be permissive about typos when searching.

Here is an example: https://quarkus.io/guides/#q=aplication gives Sorry, no guides matched your search. Please try again. Same for https://quarkus.io/guides/#q=Configuring+your+application vs. https://quarkus.io/guides/#q=Configuring+your+aplication

Is there a way to tolerate typos because they are quite common, especially for non-native speakers?

Some approximation (I think there was something for it in HS), maybe there is an existing mapping for common English misspelled words, ...

yrodiere commented 1 month ago

Hey,

This would be a nice feature indeed.

there is an existing mapping for common English misspelled words, ...

I don't think a hard coded list will work, no. Fortunately, there are other solutions :)

We need to consider two things IMO: how to match "approximately", and when to match approximatly.

How

Fuzzy queries (which allow terms with one or two typos) are a thing, but I'd personally stay away from them, because:

  1. They reach their limits quite fast, and then you have to switch to a completely different solution.
  2. They are not available everywhere; e.g. I'm not sure we can use them "by default" in the simple query strings we're using right now in search.

A better approach is to have dedicated fields using an ngram analyzer, e.g. turn tokens into a list of 3-grams:

When

We could do a "OR" between the current search criteria and the new "fuzzy" ones, but this means that, when searching without typos, we will return a long tail of potentially irrelevant results.

A perhaps better solution would be to run the search without typo support first, and only if we notice that search doesn't match anything, ignore it, then run another search with typo support (more fuzzy), then return the results of that second search.

Resources

I tried to explain how to do ngram search here: https://discourse.hibernate.org/t/slop-does-not-work-for-any-word/9253/6?u=yrodiere

As I mentioned above though, we probably don't want to put all predicates in the same query, but rather do something like this:

var results = doSearchWithoutTypoSupport(params);
if (results.total().hitCountLowerBound() == 0) {
   results = doSearchWithTypoSupportUsingNgrams(params);
}
return results;

PRs welcome :)