roy-ht / editdistance

Fast implementation of the edit distance(Levenshtein distance)
MIT License
661 stars 62 forks source link

Fix bug for long sequences (> 640) #103

Closed boeddeker closed 9 months ago

boeddeker commented 1 year ago

I used this package to compute the so called concatenated minimum permutation word error rate. While this package is faster than kaldialign, kaldialign computes the individual ins/del/sub.

I compared the distance values for long texts (several thousand words) and they were not the same. It turned out, that this package uses another implementation (edit_distance_dp), when the number of words is larger than 640.

In the code of edit_distance_dp is a bug, the first value in the vector is not initialized.

This PR contains a fix for the bug and adds tests for edit_distance_dp.

I had to expose edit_distance_dp to write a test.

thequilo commented 1 year ago

This bug was introduced in #39