krisk / Fuse

Lightweight fuzzy-search, in JavaScript
https://fusejs.io/
Apache License 2.0
17.89k stars 759 forks source link

Free word order and parts of words #41

Closed netAction closed 8 years ago

netAction commented 9 years ago

I really love the fuzziness but fuse.js seems not to find entries when the words are in another order or only parts are entered. Uni Mannheim is much closer to FH Mannheim than to University Mannheim now.

krisk commented 9 years ago

Interesting. Let me dig into that a little more.

jeancroy commented 9 years ago

That is an interesting question.

Part of the answer is that when netAction says uni is much closer to university than FH, he intuitively use a metric close to the number of common character in proper order (length of LCS, llcs). However bitap algorithm measure edit distance. To compare the two metrics let's look at this example.

Matches are indicated by | and simple edit distance (insert/delete) by -. In the following example we have 5 matches and a distance of 3.

sur-ve-y
|||  | |
surg-ery

Levenshtein edit distance would have grouped -v and g- as a single substitution operation : , giving a distance of 2

surve-y
|||:| |
surgery

Now, let's look at the case that started this thread

uni-------    uni
|||           ::
university    FH-

uni vs university: 3 matches, but 7 insertion away
uni vs FH: 0 matches, but 2 sub and 1 del away.

Now it's true that finding the sequence of character that have the maximum match, or minimum error distance, are equivalent problem while comparing whole string A to whole string B.

But given a list of possible matches, result with the most matches (LLCS) and result with the least distance are not the same, especially when the item have different length. There's a lot of subtle difference between those two metrics, especially when matching only part of A against part of B.

Computer Science usually think in term of edit distance for things like spell checker, index, and file diff. But Computational Biology think in term of LCS to match strings representation of things like protein or DNA. One interesting thing about Computational Biology is that they are mainly interested in finding alignment (like the figures I used above) so they have developed solutions to the highly desirable, as it has been said, highlight feature.

My theory is that LLCS is well suited for auto-complete while distance might be better for auto-correct
For example, in an auto-complete scenario the part that will be completed should not introduce a large penalty (...versity, incur a penalty of 7 and prevent uni to match university !!)

To test this, I have started by writing to write a fuse searcher, using a nice nice O(m+n) bit-parallel LCS algorithm to replace bitap O(m*n) edit distance. (Since then, it has now become it's own project without dependency). Project also support free word ordering and result highlight (including free word order on highlight), and fallback to another algorithm when bit representation do not fit in a int32.

I have made a demo here

To compare I used fuse book dataset as the small dataset (now would be the right time to ask permission to use it, if not a bit late). The difference between O(m+n) vs O(m*n) can be seen comparing to timed fuse example. They are a bit apple to orange comparison, but in fuse example there is a steady increase of about 1ms per character, which is about the time needed to do one pass over the data. This demo algorithm do not show this behaviour (or will show a steady increase per word instead of per character because I do one pass for each token)

PS: This is my first try at releasing an open source project so comment are more than welcome.

netAction commented 9 years ago

Maybe the current algorithm is not that bad. But it might be better to split the words and search with any order. Treating spaces like any other character is a bad idea no matter what algorithm is used.

jeancroy commented 9 years ago

Oh it can definitively be done. It would speed up starting position search, and slow down matching each word of query against each word or item.

It's just there's quite a lot of plumbing required.

split query into space separated words. this.pattern, this.patternAlphabet would become array of patterns, patternAlphabets

analyzeText should split item value into words, apply searcher.search on each of those words, collate best word match in each case.

search could either manage array of patterns itself or be given a single pattern / alphabet as an argument and be looped by it's caller. (I opted for second option, because I have found that foreach query word, find best item word, add best score result are better that foreach item word find best query word, add best score)

lastly this update would probably change good answer of some unit test