schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching
MIT License
4 stars 0 forks source link

TPL vs GBIF: encodings and var. #4

Closed dschwilk closed 10 years ago

dschwilk commented 10 years ago

Note that The Plant List does not have names with umlauts or accents. This is not a problem with Beth's (@ejforrestel) scraping -- it is just how they have set up the standardized database. Historically, german ö was represented "oe" in plant names but ü and ë are often used in databases and do show up in GBIF.

in gbif_occurrence_names.txt, there are both ü and ë and the file is in utf-8. No problem, all my code is unicode aware, but "Leucothoë bahiensis" will be distance 1.00 from "Leucothoe bahiensis" in TPL.

So, no big deal, we could rely on the fuzzy matching, or we could do special conversions. We are probably ok just fuzzy matching at a threshold distance of 2 as that allows one umlaut and one other misspelling and still will hit, eg "Pomatocalpa künstleria" vs "Pomatocalpa kunstleri".

dschwilk commented 10 years ago

Oh, and the second point of this note: GBIF names lack "var."

So we can omit any var. names from the expanded names list before searching (elist)

All the gbif names are strict binomials with two exceptions: "Cyclamen alpinum cv." "Spiraea intermedia cv."

We can throw away the "cultivar" / cv. designation.