vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Add support for "fuzzy" Levenshtein distance matching #13814

Closed jobergum closed 2 years ago

jobergum commented 4 years ago

We get a few questions on how to do fuzzy matching with Vespa. Current support for 'fuzzy' is through:

It could seem like people by fuzzy means matching strings where the Levenshtein distance is below some run time threshold. (https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/search/FuzzyQuery.html)

I think we should consider implementing a simple edit distance matcher, which matches a field if the distance is below some run time threshold.

field name type string {
  indexing: summary|attribute
  attribute:fast-search
}

Example query:

select * from sources * where name contains [{"distance":3}fuzzy("levenstein"); 

If there exist indexing techniques for speeding up evaluation we can consider adding a match setting with a configured max distance on the field setting as well.

geirst commented 2 years ago

Adding support for fuzzy matching is a rather comprehensive task as it involves changes in many parts of the code base. On a high level this is what needs to be done:

Overall:

Java:

C++:

geirst commented 2 years ago

Decision after architect meeting regarding YQL syntax:

leiless commented 2 years ago

Hi, @geirst. cc @bratseth, @alexeyche, @andreer.

I check the https://github.com/vespa-engine/vespa/pull/21689/files PR, but I haven't found the ranking result part. So I assume the text recall and ranking score are separated.

I try to implement fzf fuzzy match (Smith-Waterman) algorithm into vespa, like in form of SELECT foo FROM bar WHERE gee CONTAINS fzf("myterm"), yet I have no idea to deal with it. fzf fuzzy match algorithm combine recall & ranking together (since it's a standalone cli tool). But the vespa separates them, where I can attach the match positions and rank score?

(FYI, I have no previous knowledge of Vespa)

jobergum commented 2 years ago

This is complete and available https://docs.vespa.ai/en/reference/query-language-reference.html#fuzzy