rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Modification of Hamming Distance #320

Closed Spill-Tea closed 1 year ago

Spill-Tea commented 1 year ago

By definition, Hamming distance should only evaluate substitutions. The Latest Modification to Hamming Distance function fundamentally changes the inherent definition, which normally assumes strings must be the same length, by now accepting insertions/deletions into a new not quite pseudo-Levenshtein distance metric.

Could these latest modifications instead define a new and separate String distance metric?

The modification in question can be found at this commit

maxbachmann commented 1 year ago

There are generally two way to handle strings of different length with the Hamming distance:

I thought that most of the time it is actually preferable to extend the shorter sequence with dummy characters, so it still delivers reasonable results. The argument for the exception would be that this could be an error in the user code, since the hamming distance is used with strings of different length.

Is there any other reason why you think the exception is preferable?

maxbachmann commented 1 year ago

I could add an extra argument to throw an exception / handle strings of different length instead.

Spill-Tea commented 1 year ago

I think thats a fair compromise - adding a boolean flag, like pad=true, which pads strings of unequal length if true, else raises value error. At first, I thought that might be better suited for the role of the string processor callable argument, but those function before comparison between input strings.

An example snippet

min_len, dist = sorted((len(s1), len(s2)))
if not pad and min_len != dist:
    raise ValueError("Strings Must be the Same Length")
maxbachmann commented 1 year ago

This is now implemented in the main branch:

>>> from rapidfuzz.distance import Hamming
>>> Hamming.distance("aa", "aaa")
1
>>> Hamming.distance("aa", "aaa", pad=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "distance/metrics_cpp.pyx", line 661, in rapidfuzz.distance.metrics_cpp_avx2.hamming_distance
ValueError: Sequences are not the same length.