jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.05k stars 158 forks source link

Performance regression hamming #188

Open maxbachmann opened 1 year ago

maxbachmann commented 1 year ago

The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:

Old

hamming_old

New

hamming

I am sure this should be fairly simple to fix given the low complexity of the hamming similarity

jamesturk commented 1 year ago

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

On Wed, Apr 19, 2023, at 7:35 PM, Max Bachmann wrote:

The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:

Old

hamming_old https://user-images.githubusercontent.com/44199644/233227647-b3422ac3-1cf5-4504-ae89-e779b7155ec4.png

New

hamming https://user-images.githubusercontent.com/44199644/233227662-421ee4cc-477f-4e52-b8b8-029c966dbaa0.png

I am sure this should be fairly simple to fix given the low complexity of the hamming similarity

— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YSFUOFYDTACOZMFZ2TXCCAFPANCNFSM6AAAAAAXEYVKIA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jamesturk commented 1 year ago

One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench

On Wed, Apr 19, 2023, at 7:55 PM, James Turk wrote:

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

On Wed, Apr 19, 2023, at 7:35 PM, Max Bachmann wrote:

The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:

Old

hamming_old https://user-images.githubusercontent.com/44199644/233227647-b3422ac3-1cf5-4504-ae89-e779b7155ec4.png

New

hamming https://user-images.githubusercontent.com/44199644/233227662-421ee4cc-477f-4e52-b8b8-029c966dbaa0.png

I am sure this should be fairly simple to fix given the low complexity of the hamming similarity

— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YSFUOFYDTACOZMFZ2TXCCAFPANCNFSM6AAAAAAXEYVKIA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

maxbachmann commented 1 year ago

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.

One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench

This other library does the same as jellyfish:

I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.

This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.

from timeit import timeit

setup="""
from jellyfish import hamming_distance as jellyfish
from rapidfuzz.distance.Hamming import distance as rapidfuzz
a="a"*200000
b="b"*200000
"""

print(timeit("jellyfish(a, b)", setup=setup, number=1000))
print(timeit("jellyfish(a, a)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, b)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))

This results in: old library version

0.386396555997635
0.4072674199996982
0.03347623199806549
0.0331174099992495

new library version:

4.556943121999211
4.313729999998031
0.033683902001939714
0.03330945900233928
jamesturk commented 1 year ago

Gotcha, I'm fine with the performance difference then. Definitely different use cases & the performance on realistic length strings is negligible. This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.

Thanks for raising this, but I don't see any significant changes worth making.

This is the Rust code sans normalization:

pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
    let (longer, shorter) = if s1.len() > s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    // distance is difference in length + differing chars
    let mut distance = longer.len() - shorter.len();
    for (i, c) in shorter.iter().enumerate() {
        if *c != longer[i] {
            distance += 1
        }
    }

    distance
}

FastVec is a stack-based string vector for short strings & uses the heap for anything longer than ~40 chars.

On Wed, Apr 19, 2023, at 8:23 PM, Max Bachmann wrote:

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.

One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench

This other library does the same as jellyfish:

• it accepts strings with any amount of similarity differences • it accepts length differences and adds them as insertion/deletion I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.

This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.

from timeit import timeit

setup=""" from jellyfish import hamming_distance as jellyfish from rapidfuzz.distance.Hamming import distance as rapidfuzz a="a"200000 b="b"200000 """

print(timeit("jellyfish(a, b)", setup=setup, number=1000)) print(timeit("jellyfish(a, a)", setup=setup, number=1000)) print(timeit("rapidfuzz(a, b)", setup=setup, number=1000)) print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))

This results in: old library version

0.386396555997635 0.4072674199996982 0.03347623199806549 0.0331174099992495 new library version:

4.556943121999211 4.313729999998031 0.033683902001939714 0.03330945900233928

— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188#issuecomment-1515582103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YWZ4VZXD7YD4OBA6SDXCCFY7ANCNFSM6AAAAAAXEYVKIA. You are receiving this because you commented.Message ID: @.***>

maxbachmann commented 1 year ago

This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.

Hm out of interest, does Python not perform the same kinds of normalization? As in does:

[x for x in my_string]

lead to any incorrect results?

maxbachmann commented 1 year ago

Hm I just had a look at the old implementation which calls: unicodedata_normalize as well. Maybe I should do this too. Sounds like this would be worth it in terms of correctness :+1:

For most metrics this should not really have a big impact on performance.

maxbachmann commented 1 year ago

Hm especially for ascii strings (which I benchmarked) this should not have any performance impact, since they are already normalized. So it should be possible to avoid the overhead in this case.

jamesturk commented 1 year ago

note to self: Alternate Rust implementation to benchmark

pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
    let (longer, shorter) = if s1.len() > s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    // distance is difference in length + differing chars
    let len_diff = longer.len() - shorter.len();
    let differing_chars = shorter.iter().zip(longer.iter()).map(|(a, b)| (a != b) as usize).sum();

    len_diff + differing_chars
}

Profiling the code shows that a lot of the time is spent there whether it is necessary or not, but I'm fine w/ that in the name of correctness at edge cases (which this library had plenty more of before adding these normalization years ago) and we're talking about nearly immeasureable amounts of time in practice so the correctness/safety is worth it for these use cases.

https://github.com/jamesturk/jellyfish/blob/main/benchmarks/compare.ipynb

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long 😄

maxbachmann commented 1 year ago

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long smile

This was mostly an odd occurrence I noticed when benchmarking my other metrics. My first thought was that my benchmarks have to be broken somehow. I do however know of a user of my lib using the Levenshtein distance + Levenshtein edit operation backtracing with sequences of over 100k characters (to compare OCR results of archive documents) :wink:

jamesturk commented 1 year ago

Gotcha, I'll keep this open until I get a chance to prod at the alternate implementation above & definitely appreciate you raising it in case there is some low hanging fruit but I don't want to micro-optimize & lose clarity/correctness for this one.

maxbachmann commented 1 year ago

Yes I completely understand this. This looks pretty damming in the benchmarks, but given the low overall times it is not really of a lot of importance.

The implementation of Levenshtein and Jaro used in jellyfish are the ones with the largest potential for optimization, but these optimizations are a lot more complex as well :sweat:

levenshtein jaro

jamesturk commented 1 year ago

These are interesting for sure, I'll need to add some length-based benchmarks on my end to see how this changes.

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap, it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

maxbachmann commented 1 year ago

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap,

Yes that's a pretty cool optimization. In C++ this is used e.g. in the std::string class as short string optimization as well.

it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Most of my users have relatively short strings, but tons of them (as in millions of strings).

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

For Jaro / JaroWinkler I have only seen relatively short sequences used. This is probably since the metric was designed to compare sequences with spelling mistakes. For the Levenshtein distance I have absolutely seen people using longer strings. The example I was talking about earlier was https://github.com/qurator-spk/dinglehopper it's used a lot in computational biology as well.

Optimizations

I have tons of different optimizations I do behind the back, while I try to provide the user with a relatively simple interface. I provide an API to compare single sequences similar to jellyfish, but I provide one to compare multiple sequences in parallel as well. The API comparing single sequences is a lot more limited in possible optimizations.

For these functions I use the following things:

When comparing multiple sequences at once I do the following things:

Algorithms

These are the complexities of my implementations of the algorithms used in jellyfish:

algorithm memory complexity time complexity
Levenshtein O(N) O([M/64]*N)
Jaro O(N) O([M/64]*N)
JaroWinkler O(N) O([M/64]*N)
DamerauLevenshtein O(N) O(M*N)
These algorithms have the following complexities in jellyfish: algorithm memory complexity time complexity
Levenshtein O(N) O(M*N)
Jaro O(N) O(M*N)
JaroWinkler O(N) O(M*N)
DamerauLevenshtein O(M*N) O(M*N)