ostap / comp

a tool for querying files in various formats
MIT License
43 stars 2 forks source link

improving performance of the fuzzy function by having a state #35

Open julochrobak opened 10 years ago

julochrobak commented 10 years ago

This improves the performance of the fuzzy function by avoiding allocation of the matrix on each call. In other words, there is a matrix allocated on the first call and reused on the following calls. If bigger matrix is required it is recreated.

In order to allow parallel execution we cannot use a single global variable to store the matrix. I've introduced a State property to the Func structure which allows every function to have a state and reuse it among the calls.

Another small performance improvement is to use own min/max function and avoid the built-in Min/Max on floats.

ostap commented 8 years ago

As tempting as it looks, I would rather not introduce the State property for the sake of performance. It has been a while, but if you are still interested in optimizing the approximate string matching, there is a different way to approach it. Instead of using len(s)*len(t) space to store the matrix D, you can do away with just 2 * min(len(s), len(t)) by keeping only two rows or columns (whichever is shorter). Here is the main loop over the matrix:

for j := 1; j < n; j++ {
        for i := 1; i < m; i++ {
                if s[i-1] == t[j-1] {
                        d[i][j] = d[i-1][j-1] // no operation required
                } else {
                        d[i][j] = min(
                                d[i-1][j]+1,   // a deletion
                                d[i][j-1]+1,   // an insertion
                                d[i-1][j-1]+1) // a substitution
                }
        }
}

Note that the indexes only go back as far as i-1 or j-1.