Closed dschwilk closed 10 years ago
And another option is to use a Levenshtein automata. solved for python: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata.
But my quick tests show that using a trie is really fast (even when the Levenshtein distance is coded in python).
Update: using an automata, I'm down to about 0.068s per name in dlist (searching against elist). It might be faster to swap that around but then we need to keep track of used names. at 0.07s a name the full tanknames->gbif lookup will run 6 hours on my machine. That is a big improvement but I'm still working on it.
OK, I have a full working matching system using Levneshtein automata for the first pass and Jaro-Winkler similarity for the second. See gbif_lookup in d601a0f. It needs to be cleaned up and generalized. And it may need tuning. But it is fast enough 310K gbif names - > 200K tanknames in a couple of hours. But it needs cleaning before merging to master. Example of matches below. Note a couple of false positives such as Juncus alpinus, which should not have been hit as J. alpigenus and J. alpinus are both accepted, but alpigenus was not in the expanded list. Might avoid with slightly higher jw threshold as that case was right near 0.95. Working pretty well though!
DLIST ELIST
3 Castalia gibertii Castalia gilbertii
11 Pycnanthemum virginicum Pycnanthemum virginianum
13 Pharmacosycea brittoni Pharmacosycea brittonii
15 Basilicum polystachion Basilicum polystachyon
27 Byttneria carthagenensis Byttneria carthagensis
58 Juncus alpigenus Juncus alpinus
59 Juncus alpino-articulatus Juncus alpinoarticulatus
71 Juncus brevistylus Juncus brevistilus
105 Juncus macrantherus Juncus macranthus
107 Juncus microcephalus Juncus macrocephalus
120 Juncus pyrenaeus Juncus pyrenaicus
140 Nebelia fragariodes Nebelia fragarioides
148 Valeriana auriculata Valeriana auricula
166 Valeriana gracilipes Valeriana gracilis
168 Valeriana hieronymi Valeriana hieronymii
197 Valeriana spicata Valeriana spica
204 Hippophaë rhamnoides Hippophae rhamnoides
274 Parnassia laxmanni Parnassia laxmannii
318 Epilobium billardiereanum Epilobium billardierianum
356 Epilobium himalayense Epilobium himalense
361 Epilobium hornemanni Epilobium hornemannii
387 Epilobium nerterioides Epilobium nerteroides
389 Epilobium novo-mexicanum Epilobium novomexicanum
390 Epilobium nummularifolium Epilobium nummulariifolium
412 Epilobium rigidum Epilobium frigidum
454 Sciadophyllum samydifolium Sciodaphyllum samydifolium
481 Callicarpa gracilipes Callicarpa gracilis
553 Roella amplexicaule Roella amplexicaulis
559 Ficus acrocarpa Ficus macrocarpa
567 Ficus ampelos Ficus ampelas
So computing the distance between a word in badnames and every word in goodnames is slow and there is a lot of repeated effort. I read a few papers on algorithms for storing partial results in such searches and it looks like using a trie data structure makes sense for us. And there is even an implementation to borrow.
I think I'll work on this next.