Closed samc0de closed 10 years ago
Thanks to samc0de for raising this. I think it was in response to an issue I raised on stackoverflow. Forgive me I have never raised an issue on gitHub so I am not sure of the etiquette protocol, but my suspisions were raised when I observed the following:
In [0]: jf.levenshtein_distance('ZX', 'XYZ')
Out[0]: 3
In [1]: jf.damerau_levenshtein_distance('BADC', 'ABCD')
Out[1]: 3
To my mind the output should follow the following rules:
In the first example: Step 1: ZX → XZ (transpose adjacent characters) Step 2: XZ → XYZ (insert 'X' character)
In the second example: Step 1: BACD → ABDC (transpose adjacent BA characters) Step 2 ABDC → ABCD (transpose adjacent DC characters)
And perhaps more peculiarly:
In [3]: jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')
Out[3]: 2
In [4]: jf.damerau_levenshtein_distance('ifhs', 'fish')
Out[4]: 3
Whis is particularly odd, as the number of edits should not only be two in both examples but they are exactly the same edits;
In the third example: Step 1: jellyifhs → jellyfihs (transpose adjacent characters 'i' and 'f') Step 2: jellyfihs → jellyfish (transpose adjacent characters 'h' and 's')
In the fourth example: Step 1: ifhs → fihs (transpose adjacent characters 'i' and 'f') Step 2: fihs → fish (transpose adjacent characters 'h' and 's')
C and Python functions were indeed OSA, not damerau, sorry for the trouble/delay
In file damerau_levenshtein.c, the function is calculating wrong(OSA) distance.