from schwilk-work: Clean names matching, proof of concept matching script

schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching

MIT License

4 stars 0 forks source link

Closed dschwilk closed 10 years ago

dschwilk commented 10 years ago

Make fuzzy_match.py module and use python-Levenshtein. TODO: remove unneeded fuzzywuzzy import and related code. fuzzywuzzy has some nice higher-level string matching and cleaning operations, but they are way too much overhead for our big lists.
Write a first-pass name matching script to create a lookup table from expanded tanktree names list -> gbif names. See commit message for todos.

dschwilk commented 10 years ago

Improvements needed:

remove anything in dlist after it is hit (maybe to the fuzzy search the other direction: for every name in dlist, find the BEST match in elist. Yup, that is probably more efficient. We could start with exact matches.
How to deal with subsp., var. etc? Just drop those and assume that the synonym expansion hit the binomial possibilities?

Edit: Both points fixed in bb39c6a