infoscout / weighted-levenshtein

Weighted Levenshtein library
MIT License
105 stars 26 forks source link

Showing a trace of the operations #4

Open JeroenvdV opened 7 years ago

JeroenvdV commented 7 years ago

Hi, thanks for this implementation, it calculates the distance that I need as I need to have custom substitution costs, but also have relatively expensive (but fixed) insertion and deletion costs.

However, I also need to know what operations take place to arrive at the costs (i.e. an order of operations to go from string 1 to string 2, such as "SMMMMM" for HANANA->BANANA). I need this in order to match elements from one sequence to equal or substituted elements in the other sequence, which in my application has metadata. If multiple are possible, the first one will do.

The backtracing algorithms I've found online no longer work when substitution is cheaper than insertion/deletion, and I don't quite understand why. I figure it's simpler to keep track of the operations throughout the execution of the algorithm. However, I haven't been able to produce an output in that way either that works well with custom weights.

I would greatly appreciate any insight into how this could be achieved or why it couldn't work.

smeylan commented 6 years ago

I wanted to file the same feature request: I was hoping for something like Levenshtein.editops (unfortunately that library does not handle weighted costs).

LaurinmyReha commented 2 years ago

@smeylan looking exactly for the same thing! have you found something to solve your problem?

taylorpetty commented 2 years ago

I'm looking for the same thing, or the output matrix at the very least. There are tutorials and StackOverflow answers that talk about how to get from the matrix to the edit path, but I can't figure out how to get the matrix itself from the lev() function in this package.

toka commented 2 years ago

To my understanding there are two problems:

a) There is no clear solution at equal cost for multiple paths.

b) There is no memory about which path was choosen. Decision logic is inside the multiple min(....)-statements in https://github.com/infoscout/weighted-levenshtein/blob/master/weighted_levenshtein/clev.pyx#L353-L373 On would have to write a special min function, which remembers the choosen operation in a separate 2d array.

taylorpetty commented 2 years ago

To my understanding there are two problems:

a) There is no clear solution at equal cost for multiple paths.

b) There is no memory about which path was choosen. Decision logic is inside the multiple min(....)-statements in https://github.com/infoscout/weighted-levenshtein/blob/master/weighted_levenshtein/clev.pyx#L353-L373 On would have to write a special min function, which remembers the choosen operation in a separate 2d array.

a) That's correct. The minimal path is not necessarily unique.

b) The edit path should be constructable from the dynamic programming array itself, since the min(...) function is reflected in the array (and isn't necessarily unique). The links in my previous post outline how it should work.

taylorpetty commented 2 years ago

Below is a function I wrote to intake a cost matrix and trace it back. The path is non-unique and, depending on the costs input into the original distance, will likely have several choices to make. The order in which these are made is arbitrary, and can be re-ordered by preference.

This code is @njit-able, which is why these edittodigit functions are in place. They intake a string and output a digit, since @njit didn't like arrays of strings.

Now all we need is the lev function in the faction output the entire dynamic programming array instead of just the bottom right corner, and we're golden.

@njit
def editpath(cost):
    """Return an edit path for Levenshtein based on cost matrix alone."""

    row = cost.shape[0]-1
    col = cost.shape[1]-1

    path = []

    while row >= 0 and col >= 0 and row + col >= 1:

        #if we're away from the top row and leftmost column:
        if row > 0 and col > 0:

            current = cost[row][col]
            icost = cost[row][col-1]
            dcost = cost[row-1][col]
            scost = cost[row-1][col-1]

            #SNP: jump a square up and to the left
            if scost <= min(dcost, icost):
                if scost == current: 
                    pathchar = 'n'
                else:
                    pathchar = 's'
                row -= 1
                col -= 1

            #del: jump up row 
            elif row > 0 and dcost <= min(scost, icost):
                pathchar = 'd'
                row -= 1

            #ins: jump left a column
            elif col > 0 and icost <= min(scost, dcost):
                pathchar = 'i'
                col -= 1

        #if we've reached the top row:
        elif row == 0:

            #insert the rest of the way
            pathchar = 'i'
            col -= 1

        #if we've reached the leftmost column:
        elif col == 0:

            #delete the rest of the way
            pathchar = 'd'
            row -= 1

        #tally the edit operation    
        path.append(edittodigit(pathchar))

    path = np.array(path)

    #reverse the order if you want to read edits beginning to end
    return path[::-1]