schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching
MIT License
4 stars 0 forks source link

Levenshtein distance speed. python-Levenshtein faster than fuzzywuzzy code. #3

Closed dschwilk closed 10 years ago

dschwilk commented 10 years ago

Pure levenshtein distances (import Levenshtein) is much faster than fuzzywuzzy. It makes sense: fuzzywuzzy is built on top of it.

So if we want the actual strings returned, this function

def fuzzy_match_levenshtein(pattern, x, threshold = 2):
    return( [i for i in x if Levenshtein.distance(pattern, i) <= threshold])

is much faster

from timeit import Timer
import codecs

gbif_names = [line.strip() for line in codecs.open("../data/name-lists/gbif-occurrences-names.txt", "r", "utf-8")]

a1 = Timer("""r = fuzzy_match_levenshtein(u"Mahonia trifoliata", gbif_names)""", """from __main__ import gbif_names, fuzzy_match_levenshtein""")
a2 = Timer("""r = fuzzy_match(u"Mahonia trifoliata", gbif_names)""", """from __main__ import fuzzy_match, gbif_names""")

print(a1.timeit(number=3))
#1.00459909439
print(a2.timeit(number=3))
#87.8827390671
dschwilk commented 10 years ago

Oops, I forgot the results using fuzzy_match( ... quick=True):

print(a3.timeit(number=3))
# 21.6809160709

But that first pass with difflib could be used in any algorithm

willpearse commented 10 years ago

This is great to read. Thanks!

Will

On 06/02/2014 07:03 PM, Dylan Schwilk wrote:

Oops, I forgot the results suing fuzzy_match( ... quick=True):

print(a3.timeit(number=3))

21.6809160709

But that first pass with difflib could be used in any algorithm

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44905969.

dschwilk commented 10 years ago

Ok, so I will start on this using the original levenshtein distnace code for matching. I'm working on code in a local branch right now and should have stuff for a pull request soon.

Note that the difflib step adds time when used with the fast levenshtein code since it is 20 times slower than the levenshtein distance calc. So I plan on removing that "quick" option.

-D

On 06/03/2014 09:54 AM, Will Pearse wrote:

This is great to read. Thanks!

Will

On 06/02/2014 07:03 PM, Dylan Schwilk wrote:

Oops, I forgot the results suing fuzzy_match( ... quick=True):

print(a3.timeit(number=3))

21.6809160709

But that first pass with difflib could be used in any algorithm

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44905969.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44974877.

willpearse commented 10 years ago

...I hadn't checked the two yet, I'd simply put in some code I'd written next to yours. Your method is so much faster than mine, I don't even see any point keeping it!

Will

On 06/03/2014 09:57 AM, Dylan Schwilk wrote:

Ok, so I will start on this using the original levenshtein distnace code for matching. I'm working on code in a local branch right now and should have stuff for a pull request soon.

Note that the difflib step adds time when used with the fast levenshtein code since it is 20 times slower than the levenshtein distance calc. So I plan on removing that "quick" option.

-D

On 06/03/2014 09:54 AM, Will Pearse wrote:

This is great to read. Thanks!

Will

On 06/02/2014 07:03 PM, Dylan Schwilk wrote:

Oops, I forgot the results suing fuzzy_match( ... quick=True):

print(a3.timeit(number=3))

21.6809160709

But that first pass with difflib could be used in any algorithm

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44905969.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44974877.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/3#issuecomment-44975315.

dschwilk commented 10 years ago

Still room for optimization, but in the area of fast automata construction. Closed with d601a0f