Open maxbachmann opened 1 year ago
can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime
On Wed, Apr 19, 2023, at 7:35 PM, Max Bachmann wrote:
The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:
Old
hamming_old https://user-images.githubusercontent.com/44199644/233227647-b3422ac3-1cf5-4504-ae89-e779b7155ec4.png
New
I am sure this should be fairly simple to fix given the low complexity of the hamming similarity
— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YSFUOFYDTACOZMFZ2TXCCAFPANCNFSM6AAAAAAXEYVKIA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench
On Wed, Apr 19, 2023, at 7:55 PM, James Turk wrote:
can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime
On Wed, Apr 19, 2023, at 7:35 PM, Max Bachmann wrote:
The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:
Old
hamming_old https://user-images.githubusercontent.com/44199644/233227647-b3422ac3-1cf5-4504-ae89-e779b7155ec4.png
New
I am sure this should be fairly simple to fix given the low complexity of the hamming similarity
— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YSFUOFYDTACOZMFZ2TXCCAFPANCNFSM6AAAAAAXEYVKIA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime
I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.
One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench
This other library does the same as jellyfish:
I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.
This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.
from timeit import timeit
setup="""
from jellyfish import hamming_distance as jellyfish
from rapidfuzz.distance.Hamming import distance as rapidfuzz
a="a"*200000
b="b"*200000
"""
print(timeit("jellyfish(a, b)", setup=setup, number=1000))
print(timeit("jellyfish(a, a)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, b)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))
This results in: old library version
0.386396555997635
0.4072674199996982
0.03347623199806549
0.0331174099992495
new library version:
4.556943121999211
4.313729999998031
0.033683902001939714
0.03330945900233928
Gotcha, I'm fine with the performance difference then. Definitely different use cases & the performance on realistic length strings is negligible. This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.
Thanks for raising this, but I don't see any significant changes worth making.
This is the Rust code sans normalization:
pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
let (longer, shorter) = if s1.len() > s2.len() {
(s1, s2)
} else {
(s2, s1)
};
// distance is difference in length + differing chars
let mut distance = longer.len() - shorter.len();
for (i, c) in shorter.iter().enumerate() {
if *c != longer[i] {
distance += 1
}
}
distance
}
FastVec is a stack-based string vector for short strings & uses the heap for anything longer than ~40 chars.
On Wed, Apr 19, 2023, at 8:23 PM, Max Bachmann wrote:
can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime
I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.
One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench
This other library does the same as jellyfish:
• it accepts strings with any amount of similarity differences • it accepts length differences and adds them as insertion/deletion I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.
This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.
from timeit import timeit
setup=""" from jellyfish import hamming_distance as jellyfish from rapidfuzz.distance.Hamming import distance as rapidfuzz a="a"200000 b="b"200000 """
print(timeit("jellyfish(a, b)", setup=setup, number=1000)) print(timeit("jellyfish(a, a)", setup=setup, number=1000)) print(timeit("rapidfuzz(a, b)", setup=setup, number=1000)) print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))
This results in: old library version
0.386396555997635 0.4072674199996982 0.03347623199806549 0.0331174099992495
new library version:
4.556943121999211 4.313729999998031 0.033683902001939714 0.03330945900233928
— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/188#issuecomment-1515582103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YWZ4VZXD7YD4OBA6SDXCCFY7ANCNFSM6AAAAAAXEYVKIA. You are receiving this because you commented.Message ID: @.***>
This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.
Hm out of interest, does Python not perform the same kinds of normalization? As in does:
[x for x in my_string]
lead to any incorrect results?
Hm I just had a look at the old implementation which calls: unicodedata_normalize as well. Maybe I should do this too. Sounds like this would be worth it in terms of correctness :+1:
For most metrics this should not really have a big impact on performance.
Hm especially for ascii strings (which I benchmarked) this should not have any performance impact, since they are already normalized. So it should be possible to avoid the overhead in this case.
note to self: Alternate Rust implementation to benchmark
pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
let (longer, shorter) = if s1.len() > s2.len() {
(s1, s2)
} else {
(s2, s1)
};
// distance is difference in length + differing chars
let len_diff = longer.len() - shorter.len();
let differing_chars = shorter.iter().zip(longer.iter()).map(|(a, b)| (a != b) as usize).sum();
len_diff + differing_chars
}
Profiling the code shows that a lot of the time is spent there whether it is necessary or not, but I'm fine w/ that in the name of correctness at edge cases (which this library had plenty more of before adding these normalization years ago) and we're talking about nearly immeasureable amounts of time in practice so the correctness/safety is worth it for these use cases.
https://github.com/jamesturk/jellyfish/blob/main/benchmarks/compare.ipynb
Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long 😄
Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long smile
This was mostly an odd occurrence I noticed when benchmarking my other metrics. My first thought was that my benchmarks have to be broken somehow. I do however know of a user of my lib using the Levenshtein distance + Levenshtein edit operation backtracing with sequences of over 100k characters (to compare OCR results of archive documents) :wink:
Gotcha, I'll keep this open until I get a chance to prod at the alternate implementation above & definitely appreciate you raising it in case there is some low hanging fruit but I don't want to micro-optimize & lose clarity/correctness for this one.
Yes I completely understand this. This looks pretty damming in the benchmarks, but given the low overall times it is not really of a lot of importance.
The implementation of Levenshtein and Jaro used in jellyfish are the ones with the largest potential for optimization, but these optimizations are a lot more complex as well :sweat:
These are interesting for sure, I'll need to add some length-based benchmarks on my end to see how this changes.
A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap, it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.
Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.
A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap,
Yes that's a pretty cool optimization. In C++ this is used e.g. in the std::string class as short string optimization as well.
it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.
Most of my users have relatively short strings, but tons of them (as in millions of strings).
Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.
For Jaro / JaroWinkler I have only seen relatively short sequences used. This is probably since the metric was designed to compare sequences with spelling mistakes. For the Levenshtein distance I have absolutely seen people using longer strings. The example I was talking about earlier was https://github.com/qurator-spk/dinglehopper it's used a lot in computational biology as well.
I have tons of different optimizations I do behind the back, while I try to provide the user with a relatively simple interface. I provide an API to compare single sequences similar to jellyfish, but I provide one to compare multiple sequences in parallel as well. The API comparing single sequences is a lot more limited in possible optimizations.
For these functions I use the following things:
When comparing multiple sequences at once I do the following things:
These are the complexities of my implementations of the algorithms used in jellyfish:
algorithm | memory complexity | time complexity |
---|---|---|
Levenshtein | O(N) | O([M/64]*N) |
Jaro | O(N) | O([M/64]*N) |
JaroWinkler | O(N) | O([M/64]*N) |
DamerauLevenshtein | O(N) | O(M*N) |
These algorithms have the following complexities in jellyfish: | algorithm | memory complexity | time complexity |
---|---|---|---|
Levenshtein | O(N) | O(M*N) | |
Jaro | O(N) | O(M*N) | |
JaroWinkler | O(N) | O(M*N) | |
DamerauLevenshtein | O(M*N) | O(M*N) |
The new rust backend appears to lead to a pretty steep performance regression in the hamming implementation:
Old
New
I am sure this should be fairly simple to fix given the low complexity of the hamming similarity