Use a Trie data structure to improve performance for Levenshtein distances

dschwilk commented 10 years ago

So computing the distance between a word in badnames and every word in goodnames is slow and there is a lot of repeated effort. I read a few papers on algorithms for storing partial results in such searches and it looks like using a trie data structure makes sense for us. And there is even an implementation to borrow.

I think I'll work on this next.

dschwilk commented 10 years ago

And another option is to use a Levenshtein automata. solved for python: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata.

But my quick tests show that using a trie is really fast (even when the Levenshtein distance is coded in python).

Update: using an automata, I'm down to about 0.068s per name in dlist (searching against elist). It might be faster to swap that around but then we need to keep track of used names. at 0.07s a name the full tanknames->gbif lookup will run 6 hours on my machine. That is a big improvement but I'm still working on it.

dschwilk commented 10 years ago

OK, I have a full working matching system using Levneshtein automata for the first pass and Jaro-Winkler similarity for the second. See gbif_lookup in d601a0f. It needs to be cleaned up and generalized. And it may need tuning. But it is fast enough 310K gbif names - > 200K tanknames in a couple of hours. But it needs cleaning before merging to master. Example of matches below. Note a couple of false positives such as Juncus alpinus, which should not have been hit as J. alpigenus and J. alpinus are both accepted, but alpigenus was not in the expanded list. Might avoid with slightly higher jw threshold as that case was right near 0.95. Working pretty well though!

                         DLIST                      ELIST
3            Castalia gibertii         Castalia gilbertii
11     Pycnanthemum virginicum   Pycnanthemum virginianum
13      Pharmacosycea brittoni    Pharmacosycea brittonii
15      Basilicum polystachion     Basilicum polystachyon
27    Byttneria carthagenensis     Byttneria carthagensis
58            Juncus alpigenus             Juncus alpinus
59   Juncus alpino-articulatus   Juncus alpinoarticulatus
71          Juncus brevistylus         Juncus brevistilus
105        Juncus macrantherus          Juncus macranthus
107       Juncus microcephalus       Juncus macrocephalus
120           Juncus pyrenaeus          Juncus pyrenaicus
140        Nebelia fragariodes       Nebelia fragarioides
148       Valeriana auriculata         Valeriana auricula
166       Valeriana gracilipes         Valeriana gracilis
168        Valeriana hieronymi       Valeriana hieronymii
197          Valeriana spicata            Valeriana spica
204       Hippophaë rhamnoides       Hippophae rhamnoides
274         Parnassia laxmanni        Parnassia laxmannii
318  Epilobium billardiereanum  Epilobium billardierianum
356      Epilobium himalayense        Epilobium himalense
361       Epilobium hornemanni      Epilobium hornemannii
387     Epilobium nerterioides      Epilobium nerteroides
389   Epilobium novo-mexicanum    Epilobium novomexicanum
390  Epilobium nummularifolium Epilobium nummulariifolium
412          Epilobium rigidum         Epilobium frigidum
454 Sciadophyllum samydifolium Sciodaphyllum samydifolium
481      Callicarpa gracilipes        Callicarpa gracilis
553        Roella amplexicaule       Roella amplexicaulis
559            Ficus acrocarpa           Ficus macrocarpa
567              Ficus ampelos              Ficus ampelas

schwilklab / taxon-name-utils

Use a Trie data structure to improve performance for Levenshtein distances #7