logannc / fuzzywuzzy-rs

port of https://github.com/seatgeek/fuzzywuzzy
GNU General Public License v2.0
40 stars 3 forks source link

Mega Change #28

Open logannc opened 3 years ago

logannc commented 3 years ago

fixes #24 fixes #22 fixes #20 fixes #7 (mostly - once you have extractWithoutOrder, the rest are basically just 'get top N' which can be done by callers)

6 might or might not make it depending on the wratio cleanups.

logannc commented 3 years ago

Okay, there is a lot here. Basically, I moved a lot into fuzzywuzzy_compatible implemented on top of various other new root crate level primitives (which, for the most part, have gracefully handled everything fuzzywuzzy wants to do). A lot got moved into fuzzywuzzy_compatible and re-exported into fuzz for now. If we improve the impls with non-compatible versions, we just stop exporting them. The apis should be the same.

I'll probably come back and edit this comment tomorrow with a more coherent/comprehensive summary.

Still stuff left to do:

logannc commented 3 years ago

tests are passing locally on nightly, its just intersperse giving us grief.

logannc commented 3 years ago

Let's see, what else...

Processor trait vs normalizers? Scorer vs calcs? Redefine misc utils in terms of primitives (i.e., full_process) Better defaults and docs on normalizes

A couple good normalizers plus grapheme segmentation is probably the best default, w/ codepoints segmentation an occasional perf boost.

I'm still not happy I can't easily represent the token set stuff and wonder if I can't unify that somehow.

Are there any other complex normalization sequences that would be useful to have default impls?

logannc commented 3 years ago

Welp, get_matching_blocks has a bug. for the inputs "frodo baggin" and "samwise g" (after codepoint segmentation), it generates a bad item in the queue which causes debug assertions to panic.

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

maxbachmann commented 3 years ago

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

Not sure whether it helps, but here is my C++ implementation of get_matching_blocks. https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L58 It is essentially the Python implementation without all autojunk parts + hash maps to improve the performance

logannc commented 3 years ago

I haven't looked at it in a few days but the problem was one of (I can't remember which offhand)

debug_assert!(low1 <= high1);
debug_assert!(low2 <= high2);
debug_assert!(high1 <= shorter.len());
debug_assert!(high2 <= longer.len());
debug_assert!(high1 - low1 <= high2 - low2);

was failing. All of these ought to be maintained for each call to find_longest_match() under the current implementation, I think.

I think your implementation suffers the same issue with

if ((spos + length) < a_high && (dpos + length) < b_high) {
    queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}

I have not looked at this since I noted it last week.

maxbachmann commented 3 years ago

I will test this in my implementation. However this is directly taken from the Python implementation (and appears to be the same in your implementation as well). I would personally assume the bug is somewhere in find_longest_match, since there your using quite a different implementation. RapidFuzz uses pretty much the same implementation as Difflib in there, but since I do not use the auto junk feature I was able to get rid of the hashmap b2j

RapidFuzz

https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L120-L128

      if (length) {
        if (a_low < spos && b_low < dpos) {
          queue.emplace_back(a_low, spos, b_low, dpos);
        }
        if ((spos + length) < a_high && (dpos + length) < b_high) {
          queue.emplace_back(spos + length, a_high, dpos + length, b_high);
        }
        matching_blocks_pass1.emplace_back(spos, dpos, length);
      }

fuzzywuzzy-rs

https://github.com/logannc/fuzzywuzzy-rs/blob/f787942c7350014604ec2e8e73dbd518ae3650bd/src/primitives.rs#L207-L215

difflib

https://github.com/python/cpython/blob/fa03efda3dc6ad118788bebc61079cd481c0b24c/Lib/difflib.py#L490-L495

            if k:   # if k is 0, there was no matching block
                matching_blocks.append(x)
                if alo < i and blo < j:
                    queue.append((alo, i, blo, j))
                if i+k < ahi and j+k < bhi:
                    queue.append((i+k, ahi, j+k, bhi))