Open OlivierBinette opened 2 years ago
Hi Olivier,
I'd be in favor of this. Keeping the distance matrix in memory is only useful if one is interested in recording the optimal sequence of edit operations. But here, we're only interested in the distance (the last entry in the matrix).
One thing to be mindful of, is that the Levenshtein distance is related to other edit-based distances in the package through class inheritance. The inheritance diagram is as follows: DamerauLevenshtein -> OSA -> Levenshtein -> LCS.
Since all of these classes currently rely on the full distance matrix, they would also be impacted by your proposal. I believe it's possible to implement LCS and Levenshtein by keeping 2 rows in memory, and OSA by keeping 3 rows in memory. For DamerauLevenshtein, it may be necessary to keep all rows in memory (need to check this).
Would you be happy to submit a pull request?
You might be interested in this: https://web.archive.org/web/20180612143641/https://bitbucket.org/clearer/iosifovich/
@ngmarchant thanks for the link! I'll look into it for further optimization, and then see if I can make a pull request.
For reference, here's the optimization I could get by reducing the memory usage a bit more:
def levenshtein(s, t):
m = len(s)
n = len(t)
buffer = np.arange(max(n, m) + 1)
p = m
for j in range(1, n+1):
temp = j-1
p = j
for i in range(1, m+1):
p = min(p + 1, buffer[i] + 1, temp + (s[i-1] != t[j-1]))
temp = buffer[i]
buffer[i] = p
return p
@ngmarchant the Levenshtein distance can be implemented using only two rows for
dmat
, instead of using a square matrix. That could significantly reduce memory usage when comparing long sequences (400 Mb to 80 Kb when comparing strings of length 10,000).Would it be worth it to implement this? I could propose the changes.
Example Python implementation: