Open logannc opened 3 years ago
Okay, there is a lot here. Basically, I moved a lot into fuzzywuzzy_compatible implemented on top of various other new root crate level primitives (which, for the most part, have gracefully handled everything fuzzywuzzy wants to do). A lot got moved into fuzzywuzzy_compatible and re-exported into fuzz
for now. If we improve the impls with non-compatible versions, we just stop exporting them. The apis should be the same.
I'll probably come back and edit this comment tomorrow with a more coherent/comprehensive summary.
Still stuff left to do:
fuzzywuzzy_compatible
should probably be python_compatible
because we're also named fuzzywuzzy?tests are passing locally on nightly, its just intersperse giving us grief.
Let's see, what else...
Processor trait vs normalizers? Scorer vs calcs? Redefine misc utils in terms of primitives (i.e., full_process) Better defaults and docs on normalizes
A couple good normalizers plus grapheme segmentation is probably the best default, w/ codepoints segmentation an occasional perf boost.
I'm still not happy I can't easily represent the token set stuff and wonder if I can't unify that somehow.
Are there any other complex normalization sequences that would be useful to have default impls?
Welp, get_matching_blocks
has a bug. for the inputs "frodo baggin" and "samwise g" (after codepoint segmentation), it generates a bad item in the queue which causes debug assertions to panic.
I think this wasn't a problem in python because the python find_longest_match
does a weird iterative solution different from ours.
I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.
Not sure whether it helps, but here is my C++ implementation of get_matching_blocks. https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L58 It is essentially the Python implementation without all autojunk parts + hash maps to improve the performance
I haven't looked at it in a few days but the problem was one of (I can't remember which offhand)
debug_assert!(low1 <= high1);
debug_assert!(low2 <= high2);
debug_assert!(high1 <= shorter.len());
debug_assert!(high2 <= longer.len());
debug_assert!(high1 - low1 <= high2 - low2);
was failing. All of these ought to be maintained for each call to find_longest_match()
under the current implementation, I think.
I think your implementation suffers the same issue with
if ((spos + length) < a_high && (dpos + length) < b_high) {
queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}
I have not looked at this since I noted it last week.
I will test this in my implementation. However this is directly taken from the Python implementation (and appears to be the same in your implementation as well). I would personally assume the bug is somewhere in find_longest_match, since there your using quite a different implementation. RapidFuzz uses pretty much the same implementation as Difflib in there, but since I do not use the auto junk feature I was able to get rid of the hashmap b2j
if (length) {
if (a_low < spos && b_low < dpos) {
queue.emplace_back(a_low, spos, b_low, dpos);
}
if ((spos + length) < a_high && (dpos + length) < b_high) {
queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}
matching_blocks_pass1.emplace_back(spos, dpos, length);
}
if k: # if k is 0, there was no matching block
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i+k < ahi and j+k < bhi:
queue.append((i+k, ahi, j+k, bhi))
fixes #24 fixes #22 fixes #20 fixes #7 (mostly - once you have extractWithoutOrder, the rest are basically just 'get top N' which can be done by callers)
6 might or might not make it depending on the wratio cleanups.