Reduce memory usage for edit distances?

ngmarchant / comparator

Similarity and distance measures for clustering and record linkage applications in R

GNU General Public License v2.0

17 stars 0 forks source link

Reduce memory usage for edit distances? #7

Open OlivierBinette opened 2 years ago

OlivierBinette commented 2 years ago

@ngmarchant the Levenshtein distance can be implemented using only two rows for dmat, instead of using a square matrix. That could significantly reduce memory usage when comparing long sequences (400 Mb to 80 Kb when comparing strings of length 10,000).

Would it be worth it to implement this? I could propose the changes.

Example Python implementation:

import numpy as np
dmat = np.zeros((100,2))

def levenshtein(s, t, dmat):
    m = len(s)
    n = len(t)
    dmat[:, 0] = np.arange(dmat.shape[0])

    for j in range(1, n+1):
        dmat[0, (j-1) % 2] = j-1
        dmat[0, j % 2] = j
        for i in range(1, m+1):
            cost = 0
            if s[i-1] != t[j-1]:
                cost = 1
            dmat[i, j % 2] = min(dmat[i-1, j % 2] + 1, dmat[i, (j-1) % 2] +
                                 1, dmat[i-1, (j-1) % 2] + cost)
    return dmat[m, n % 2]

levenshtein("test", "testt", dmat)

ngmarchant commented 2 years ago

Hi Olivier,

I'd be in favor of this. Keeping the distance matrix in memory is only useful if one is interested in recording the optimal sequence of edit operations. But here, we're only interested in the distance (the last entry in the matrix).

One thing to be mindful of, is that the Levenshtein distance is related to other edit-based distances in the package through class inheritance. The inheritance diagram is as follows: DamerauLevenshtein -> OSA -> Levenshtein -> LCS.

Since all of these classes currently rely on the full distance matrix, they would also be impacted by your proposal. I believe it's possible to implement LCS and Levenshtein by keeping 2 rows in memory, and OSA by keeping 3 rows in memory. For DamerauLevenshtein, it may be necessary to keep all rows in memory (need to check this).

Would you be happy to submit a pull request?

ngmarchant commented 2 years ago

You might be interested in this: https://web.archive.org/web/20180612143641/https://bitbucket.org/clearer/iosifovich/

OlivierBinette commented 2 years ago

@ngmarchant thanks for the link! I'll look into it for further optimization, and then see if I can make a pull request.

OlivierBinette commented 2 years ago

For reference, here's the optimization I could get by reducing the memory usage a bit more:

def levenshtein(s, t):
    m = len(s)
    n = len(t)
    buffer = np.arange(max(n, m) + 1)

    p = m
    for j in range(1, n+1):
        temp = j-1
        p = j
        for i in range(1, m+1):
            p = min(p + 1, buffer[i] + 1, temp + (s[i-1] != t[j-1]))
            temp = buffer[i]
            buffer[i] = p
    return p