[Feature request] Prefix match support fuzziness

KazaBen commented 5 months ago

Is your feature request related to a problem? Please describe.

When implementing autocomplete application, most of the queries are incomplete words. Sometimes misspells happen in incomplete queries too. Let's look at an example:

Document = Fjallraven (notice double L) Query = Fjalr (incomplete query, with misspell: only a single L)

It is a must do for autocomplete application to be still able to match Fjallraven with user query Fjalr.

Describe the solution you'd like

When using Prefix match add support for Fuzziness.

Describe alternatives you've considered

Combining prefix matching with word fuzzy matching: term (prefix) contains OR term contains fuzzy However, it would not cover the previously mentioned common case: Document = Fjallraven (notice double L) Query = Fjalr (incomplete query, with misspell: only a single L)

term (prefix) contains couldn't match as it query has a misspell and there is no fuzziness support term contains fuzzy couldn't match as it is incomplete word

Additional context

This exact functionality exists in Elasticsearch’s Completion Suggester Fuzzy Queries

KazaBen commented 5 months ago

@vekterli @geirst Hi 👋

We have a major project to implement autocomplete on Vespa for Vinted, and we depend on fuzziness :o Are there any estimates when this could be implemented, or should we look for other solutions?

Thank you ❤️

tkaessmann commented 5 months ago

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query? Or did i understand something wrong?

Greetings, Tobias

vekterli commented 5 months ago

@KazaBen I've started working on this feature between a few other things. Implementation will be a two-step process:

Add support for prefix matching to our core Levenshtein algorithm implementations (Deterministic Finite Automata used for max edits {1, 2} and a generalized Levenshtein matrix fallback for max edits > 2). Work in progress.
Wire this new prefix matching mode into the query evaluation pipeline. Not yet started.

I should be able to give an estimate once I've started work on part 2 (hopefully within a few days, assuming nothing else pops up) and have gotten a gist of the complexity involved.

vekterli commented 4 months ago

Update: part 2 has been completed and its PR is pending code review. Once it has been merged and a new Vespa version has been released containing the changes, fuzzy prefix matching can be used by adding a prefix:true annotation to your query term. Example YQL:

{maxEditDistance:1,prefix:true}fuzzy("Fjalr")

Assuming no other blockers, I'd expect it to be available as part of an Open-source release some time next week. I'll update this issue with a concrete version number once that happens.

A few notes:

Fuzzy prefix matching will often end up matching a lot more terms than non-prefix matching, so this should be taken into account when constructing queries—in particular when query strings are short. For instance, the query {maxEditDistance:2,prefix:true}fuzzy("Fj") will match all terms since any possible prefix can be trivially transformed to Fj with 2 edits. This generalizes to be the case for every query where maxEditDistance >= len(query).
Related to the above, prefix locking (prefixLength:n) can be used alongside prefix matching to constrain the candidate set to terms that have prefix that exactly matches n characters of the query term. This also greatly speeds up dictionary scans.
Although this implements fuzzy prefix matching, one piece of the puzzle that is still missing is a ranking feature that exposes the actual edit distance between the term and the query. This is nothing new, as the same applies to non-prefix fuzzy queries. We have an existing ticket #24242 for adding this.

vekterli commented 4 months ago

@KazaBen version 8.337.85 is now on Docker hub, which contains support for fuzzy prefix matching. I haven't added it to the official documentation just yet, but my previous comment should have the most important bits of information (and caveats...!). Would be great if you could test it out and let me know if it solves your use case.

vekterli commented 3 months ago

Prefix matching with fuzzy semantics is fully supported and documented, so I'm marking this issue as closed.

I had a stab at implementing a fuzzy ranking feature, but It's Complicated™ due to the many different kinds of string index abstractions we have in the backend. It would likely have to be prioritized as its own, dedicated work package.

vekterli commented 3 months ago

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query? Or did i understand something wrong?

Greetings, Tobias

@tkaessmann sorry, I just realized I had entirely forgotten to reply to your message.

Initially running a regular fuzzy query based on the prefix alone would generally not return the results required to correct the query since the prefix would be too far away in edit distance from the complete term(s) (which is what the dictionary stores). There's also an inherent ambiguity in what "correct" would entail for any given input—a fuzzy prefix query for "ban" (with max edit distance 1) could be a search intent for e.g. "bananas" (exact prefix, 0 edits) or "batteries" (1 edit), or even "tanning salon" (still 1 edit). This would require expanding the input query to a potentially very large number of exact prefix queries.

The new fuzzy prefix feature requires only at most a single scan of the dictionary to find all matches—usually skipping large parts of it along the way due to our DFA-based optimizations.

I'll also very conveniently cite the May 2024 newsletter—hot off the presses:

A common use case for interactive search experiences is type-ahead searching. Here results start appearing immediately as the user types, getting more and more refined as the user keeps typing. One way to solve this is by using prefix matching; a user searching for the prefix “Edvard Gr” will get back documents matching “Edvard Grieg”. However, prefix matching requires the query and the document prefixes to match exactly; searching for “Edward Gr” or “Eddvard Gr” will not return anything at all. This may fail to surface many potential completions.

Vespa has supported “typo-friendly” fuzzy matching of terms for years, where a string matches if it can be transformed to the query within a query-specified maximum number of edits (inserts, deletions, or substitutions), but this matches entire strings and therefore cannot be used for prefixes. E.g. “Edward Gr” will not fuzzy-match “Edvard Grieg” with max edits of 1 because matching would require both replacing “w” with “v” as well as inserting “ieg” to the end, which is 4 edits in total.

Vespa 8.337 and beyond add support for fuzzy prefix matching, which combines the best of both worlds. A string is considered a match if its prefix can be transformed to the query within a specified maximum number of edits. This means “Edward Gr” and “Eddvard Gr” will match “Edvard Grieg”, “Mettal” will match “Metallica” and so on.

tkaessmann commented 3 months ago

@vekterli you got me with the performance argument :) Thanks for your response!

KazaBen commented 3 months ago

@vekterli Hi, sorry for late response, but we have tested this in production 👍

Overall feature works well, matches what is expected! Well done, and thank you! 🙇

Our observed latency was ~16ms, however, we have decided to not use this, as we use a different indexing technique to achieve the same in ~6ms.

Simply indexing an additional field title_edge_ngram, both attributes with fast-search. title: "zara shoes" title_edge_ngram: [z, za, zar, zara, zara s, zara sh, zara sho, zara shoe, zara shoes]

prefix query (~220ms) SELECT * FROM suggestions WHERE title CONTAINS{prefix:true} @user_query

dropping prefix queries and querying title_edge_ngram field (~25ms) SELECT * FROM suggestions WHERE title_edge_ngram CONTAINS @user_query

Same for fuzzy prefix queries (~16ms): SELECT * FROM suggestions WHERE title CONTAINS{prefixLength:2, maxEditDistance: 1, prefix: true} fuzzy(@user_query)

Dropping prefix queries (~5ms) SELECT * FROM suggestions WHERE title_edge_ngram CONTAINS{prefixLength:2, maxEditDistance: 1} fuzzy(@user_query)

Our use case:

60M documents
3 very big nodes
3 groups
10k RPS

vespa-engine / vespa

[Feature request] Prefix match support fuzziness #30720