roy-ht / editdistance

Fast implementation of the edit distance(Levenshtein distance)
MIT License
661 stars 62 forks source link

Values with the same hash are treated as equal #37

Open brady-ds opened 5 years ago

brady-ds commented 5 years ago

The following test fails because all of the Bug objects are incorrectly treated as equal:

#! /usr/bin/env python3

import editdistance

class Bug(str):
    def __hash__(self):
        return 0

if __name__ == '__main__':
    expected = editdistance.eval(
        ('a', 'b'),
        ('b', 'c'),
    )
    actual = editdistance.eval(
        (Bug('a'), Bug('b')),
        (Bug('b'), Bug('c')),
    )
    assert expected == actual, f'{expected} ≠ {actual}'

Note that the implementation of __hash__ here is perfectly valid; to quote from https://docs.python.org/3/reference/datamodel.html#object.__hash__:

The only required property is that objects which compare equal have the same hash value.

In particular, there is no requirement that objects with the same hash value compare equal, as can also be demonstrated with even build-in types like int:

>>> import sys
>>> a = 1 << sys.hash_info.width
>>> b = hash(a)
>>> a
18446744073709551616
>>> b
8
>>> hash(a) == hash(b)
True
>>> a == b
False
>>> 
maxbachmann commented 3 years ago

I wouldn't call this a bug. It is clearly defined, that editdistance uses the hash value to compare two objects for equality. I do not really see a better way to achieve the similarity check.

rvsasseen commented 2 years ago

I agree that this is essentially a bug, though a design bug rather than an implementation bug. The natural design is to check for equality of the objects using ==, which would enable the OP's example to work. What would be the problem with checking for object equality rather than checking for hash code equality? (Are you just concerned about speed?)

Another way of looking at this is, what if you want to do edit distance but at the word level rather than the character level, so that you pass editdistance.eval() two sequences of string (str) instances. The eval() function should compare strings for equality, not for identical hash code. I don't believe all strings have unique hash codes, so your library currently appears unable to support computing word level edit distance on sequences of strings correctly.

maxbachmann commented 2 years ago

What would be the problem with checking for object equality rather than checking for hash code equality? (Are you just concerned about speed?)

Yes this implementation has the goal to be very fast. E.g. for:

s1=["aaa"]*64
s2=["bab"]*64
editdistance.eval(s1, s2)

this would change the time complexity from O(N) to O(N^3). This is not a small performance regression, but makes the library unusable for many users.

Another way of looking at this is, what if you want to do edit distance but at the word level rather than the character level, so that you pass editdistance.eval() two sequences of string (str) instances. The eval() function should compare strings for equality, not for identical hash code. I don't believe all strings have unique hash codes, so your library currently appears unable to support computing word level edit distance on sequences of strings correctly.

This is a trade off between performance and correctness. You are always able to provide your own hash implementation to reduce the probability of a hash collision on your dataset. Or if performance is no concern for you, you can write a simple implementation of wagner-fischer for this. In this case this library is probably simply not the right choice for your application.

rvsasseen commented 2 years ago

There are a few things there I don't understand, possibly due to my own ignorance. Edit distance is said to be O(MN), where M and N are the lengths of the sequences, so why wouldn't your time complexity be O(N^2) to start since in your example M = N = 64? So far we've presumably been treating the average time to compare two sequence elements as essentially constant, and certainly unrelated to either M or N. Let's assume the string elements are all of equal length K. For string elements of length K (3 in your example of "aaa" and "bbb"), naively it would take O(K) character comparisons to compare the strings. So for string elements of equal length wouldn't we have O(MNK) with the naive comparison? But it seems to me that string equality checking is implemented by first checking the hash codes, and only if equal checking the individual characters one by one. The latter would very rarely be necessary, so there would not be a large performance impact. That's the point of a good hash code. You might say that you have to compute the hash codes in the first place, but that's only O((M+N)K), and I believe python will cache the hash codes for the strings, so the hashes only need to be computed once.

I do appreciate having the library, just thinking about whether it can be better.

maxbachmann commented 2 years ago

There are a few things there I don't understand, possibly due to my own ignorance. Edit distance is said to be O(MN), where M and N are the lengths of the sequences, so why wouldn't your time complexity be O(N^2) to start since in your example M = N = 64?

editdistance uses a bitparallel implementation which calculates the Levenshtein distance for 64 characters in parallel. So the time complexity is O(M/64 * N). For text lengths < 64 this is O(N). You can see this when benchmarking:

Here python-Levenshtein uses the O(NM) approach while the others are O(M/64 * N), which can be seen in the jump every 64 characters and otherwise linear growth.

So for string elements of equal length wouldn't we have O(MNK) with the naive comparison?

Yes I mostly wanted to state it will be cubic runtime. It is O(MNK). After thinking about this again, it would indeed be possible to, modify the existing implementation to support this and achieve O([N/64]MK) this way.

But it seem to me that string equality checking is implemented by first checking the hash codes, and only if equal checking the individual characters one by one. The latter would very rarely be necessary, so there would not be a large performance impact. That's the point of a good hash code.

I wouldn't say this is rare, since this has to be done both for hash collision (which are rare) and for similar elements. E.g. for

s1=["aaa"]*64
s2=["aaa"]*64
editdistance.eval(s1, s2)

this has to be done for every single element. In addition note that even though this equal comparision is only O(K), it has a large constant factor. It requires the implementation to call a Python function for each of the comparisons, which is expensive

You might say that you have to compute the hash codes in the first place, but that's only O((M+N)K), and I believe python will cache the hash codes for the strings, so the hashes only need to be computed once.

This cost already has to be paid right now, so it does not really matter.

maxbachmann commented 2 years ago

this has to be done for every single element. In addition note that even though this equal comparision is only O(K), it has a large constant factor. It requires the implementation to call a Python function for each of the comparisons, which is expensive

Maybe it is not as bad for strings, since at least strings can be interned by Python. So an implementation could use the following equal check:

equal = (hash1 == hash2) and (obj1 is obj2 or obj1 == obj2)

which would be pretty much free for interned strings.

Edit: I forgot this is only done for strings known from the script and not for runtime generated strings, which should be a lot more common when string matching. Still it could make sense to change the API in the following way:

s1=["aaa"]*64
s2=["aaa"]*64
editdistance.eval(s1, s2)
# use hash + eq

editdistance.eval(editdistance.HashSequence(s1), editdistance.HashSequence(s2))
# directly work with hashes
rvsasseen commented 2 years ago

That sounds plausible. Thanks for your consideration and for the information. I may come back to this later.

I wouldn't say this is rare, since this has to be done both for hash collision (which are rare) and for similar elements.

I think your point is that for sequences whose edit distances are small relative to the sequence lengths, there will be a lot of hash collisions (from equal elements), so I was incorrect to say that is rare, good point. I think referring to "similar elements" is not a good description of the situation, though.