Open brady-ds opened 5 years ago
I wouldn't call this a bug. It is clearly defined, that editdistance uses the hash value to compare two objects for equality. I do not really see a better way to achieve the similarity check.
I agree that this is essentially a bug, though a design bug rather than an implementation bug. The natural design is to check for equality of the objects using ==, which would enable the OP's example to work. What would be the problem with checking for object equality rather than checking for hash code equality? (Are you just concerned about speed?)
Another way of looking at this is, what if you want to do edit distance but at the word level rather than the character level, so that you pass editdistance.eval() two sequences of string (str) instances. The eval() function should compare strings for equality, not for identical hash code. I don't believe all strings have unique hash codes, so your library currently appears unable to support computing word level edit distance on sequences of strings correctly.
What would be the problem with checking for object equality rather than checking for hash code equality? (Are you just concerned about speed?)
Yes this implementation has the goal to be very fast. E.g. for:
s1=["aaa"]*64
s2=["bab"]*64
editdistance.eval(s1, s2)
this would change the time complexity from O(N)
to O(N^3)
. This is not a small performance regression, but makes the library unusable for many users.
Another way of looking at this is, what if you want to do edit distance but at the word level rather than the character level, so that you pass editdistance.eval() two sequences of string (str) instances. The eval() function should compare strings for equality, not for identical hash code. I don't believe all strings have unique hash codes, so your library currently appears unable to support computing word level edit distance on sequences of strings correctly.
This is a trade off between performance and correctness. You are always able to provide your own hash implementation to reduce the probability of a hash collision on your dataset. Or if performance is no concern for you, you can write a simple implementation of wagner-fischer for this. In this case this library is probably simply not the right choice for your application.
There are a few things there I don't understand, possibly due to my own ignorance. Edit distance is said to be O(MN), where M and N are the lengths of the sequences, so why wouldn't your time complexity be O(N^2) to start since in your example M = N = 64? So far we've presumably been treating the average time to compare two sequence elements as essentially constant, and certainly unrelated to either M or N. Let's assume the string elements are all of equal length K. For string elements of length K (3 in your example of "aaa" and "bbb"), naively it would take O(K) character comparisons to compare the strings. So for string elements of equal length wouldn't we have O(MNK) with the naive comparison? But it seems to me that string equality checking is implemented by first checking the hash codes, and only if equal checking the individual characters one by one. The latter would very rarely be necessary, so there would not be a large performance impact. That's the point of a good hash code. You might say that you have to compute the hash codes in the first place, but that's only O((M+N)K), and I believe python will cache the hash codes for the strings, so the hashes only need to be computed once.
I do appreciate having the library, just thinking about whether it can be better.
There are a few things there I don't understand, possibly due to my own ignorance. Edit distance is said to be O(MN), where M and N are the lengths of the sequences, so why wouldn't your time complexity be O(N^2) to start since in your example M = N = 64?
editdistance uses a bitparallel implementation which calculates the Levenshtein distance for 64 characters in parallel. So the time complexity is O(M/64 * N)
. For text lengths < 64 this is O(N)
. You can see this when benchmarking:
Here python-Levenshtein
uses the O(NM)
approach while the others are O(M/64 * N)
, which can be seen in the jump every 64 characters and otherwise linear growth.
So for string elements of equal length wouldn't we have O(MNK) with the naive comparison?
Yes I mostly wanted to state it will be cubic runtime. It is O(MNK)
. After thinking about this again, it would indeed be possible to, modify the existing implementation to support this and achieve O([N/64]MK)
this way.
But it seem to me that string equality checking is implemented by first checking the hash codes, and only if equal checking the individual characters one by one. The latter would very rarely be necessary, so there would not be a large performance impact. That's the point of a good hash code.
I wouldn't say this is rare, since this has to be done both for hash collision (which are rare) and for similar elements. E.g. for
s1=["aaa"]*64
s2=["aaa"]*64
editdistance.eval(s1, s2)
this has to be done for every single element. In addition note that even though this equal comparision is only O(K)
, it has a large constant factor. It requires the implementation to call a Python function for each of the comparisons, which is expensive
You might say that you have to compute the hash codes in the first place, but that's only O((M+N)K), and I believe python will cache the hash codes for the strings, so the hashes only need to be computed once.
This cost already has to be paid right now, so it does not really matter.
this has to be done for every single element. In addition note that even though this equal comparision is only O(K), it has a large constant factor. It requires the implementation to call a Python function for each of the comparisons, which is expensive
Maybe it is not as bad for strings, since at least strings can be interned by Python. So an implementation could use the following equal check:
equal = (hash1 == hash2) and (obj1 is obj2 or obj1 == obj2)
which would be pretty much free for interned strings.
Edit: I forgot this is only done for strings known from the script and not for runtime generated strings, which should be a lot more common when string matching. Still it could make sense to change the API in the following way:
s1=["aaa"]*64
s2=["aaa"]*64
editdistance.eval(s1, s2)
# use hash + eq
editdistance.eval(editdistance.HashSequence(s1), editdistance.HashSequence(s2))
# directly work with hashes
That sounds plausible. Thanks for your consideration and for the information. I may come back to this later.
I wouldn't say this is rare, since this has to be done both for hash collision (which are rare) and for similar elements.
I think your point is that for sequences whose edit distances are small relative to the sequence lengths, there will be a lot of hash collisions (from equal elements), so I was incorrect to say that is rare, good point. I think referring to "similar elements" is not a good description of the situation, though.
The following test fails because all of the
Bug
objects are incorrectly treated as equal:Note that the implementation of
__hash__
here is perfectly valid; to quote from https://docs.python.org/3/reference/datamodel.html#object.__hash__:In particular, there is no requirement that objects with the same hash value compare equal, as can also be demonstrated with even build-in types like
int
: