jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 158 forks source link

damerau_levenshtein_distance calculates optimal string alignment distance instead of the actual damerau-levenshtein distance. #13

Closed samc0de closed 10 years ago

samc0de commented 10 years ago

In file damerau_levenshtein.c, the function is calculating wrong(OSA) distance.

RMGProjects commented 10 years ago

Thanks to samc0de for raising this. I think it was in response to an issue I raised on stackoverflow. Forgive me I have never raised an issue on gitHub so I am not sure of the etiquette protocol, but my suspisions were raised when I observed the following:

In [0]: jf.levenshtein_distance('ZX', 'XYZ')
Out[0]: 3
In [1]: jf.damerau_levenshtein_distance('BADC', 'ABCD')
Out[1]: 3

To my mind the output should follow the following rules:

In the first example: Step 1: ZX → XZ (transpose adjacent characters) Step 2: XZ → XYZ (insert 'X' character)

In the second example: Step 1: BACD → ABDC (transpose adjacent BA characters) Step 2 ABDC → ABCD (transpose adjacent DC characters)

And perhaps more peculiarly:

 In [3]: jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')
 Out[3]: 2
 In [4]: jf.damerau_levenshtein_distance('ifhs', 'fish')
 Out[4]: 3

Whis is particularly odd, as the number of edits should not only be two in both examples but they are exactly the same edits;

In the third example: Step 1: jellyifhs → jellyfihs (transpose adjacent characters 'i' and 'f') Step 2: jellyfihs → jellyfish (transpose adjacent characters 'h' and 's')

In the fourth example: Step 1: ifhs → fihs (transpose adjacent characters 'i' and 'f') Step 2: fihs → fish (transpose adjacent characters 'h' and 's')

jamesturk commented 10 years ago

C and Python functions were indeed OSA, not damerau, sorry for the trouble/delay