ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

Consider a "clean" name column for matching against #30

Closed cboettig closed 4 years ago

cboettig commented 5 years ago

@karinorman & @sckott

Our current strategy matches requested names directly against the scientificName field, which is whatever name we get from the original names provider. I'm thinking we should create a new column of 'cleaned' names that would potentially not be part of the actual schema, but might be easier to match against in JOINS. The trick is to do this in a way that is still true to the original database and doesn't implicitly introduce some hidden assumptions.

For example:

In the lowercase branch, I've added a step which first does a mutate(input = tolower(scientificName)) on the stored database, simply to create a lowercase version of all the scientificNames (at any rank), so that we can do case-insensitive joins by also lowercasing the input query before running the join. Currently this is done when ids() is called, so it doesn't involve altering the Darwin Core records from data-raw. That adds un-necessary overhead computation, though only a second or so since mutate with tolower is actually pretty fast here: when the data is in an external database, dplyr is translating this in the SQL tolower method, not applying the base R method.

This makes joins case-insensentive, but we still have other cases where we will want a join to succeed but it doesn't. For instance, OTT has the synonym Chondria tenuissima (Withering) C.Agardh, 1817, which means that Chondria tenuissima fails to return a match. However, just hacking off anything after the first two words might be implicitly introducing taxonomic assumptions that are not guaranteed (compared to being case- insensitive, which seems like a safer assumption).

The function clean_names() is intended to be applied to input names, and it can optionally do things like binomial-ize names to make this matching easier, but to actually get matches it should probably be done to species names as well (as an additional column). Opening this for us to think more about this issue.

sckott commented 5 years ago

a clean column seems like a good idea in general. lowercasing seems safe. just using first two words will be a problem when matching anything with subspecific ranks, etc. And then there's unicode

cboettig commented 5 years ago

33 adds mutate_db support for generic operations

(Note that in GBIF we already do:

https://github.com/cboettig/taxadb/blob/413e6cf6e388e8015b15e72969100f48422047a4/data-raw/gbif.R#L33

we should really be preserving scientificName there, and parsing it into authority+canonicalName; I wonder if GBIF is doing that manually or by script? @sckott ?

sckott commented 5 years ago

it's possible they use https://www.gbif.org/developer/species#parser but I don't know for sure

cboettig commented 5 years ago

@sckott thanks, that does look promising... have you ever tried to track the source code down for that?

sckott commented 5 years ago

no, but its almost 100% going to be Java since they are a Java shop. this comes back to my attempt at C++ name parsing https://github.com/ropenscilabs/pegax but is stalled to due it being hard, and the frustrating situation with https://gitlab.com/gogna/gnparser/ being hard to wrap in R because it's built as a server, gnparser is the parsing behind taxize::gnr_resolve for http://resolver.globalnames.org/

cboettig commented 5 years ago

yeah, I was mostly hoping to reverse-engineer (some of) the java or whatever rather than wrap it, for reasons you've pointed out already. e.g. I know globalnames is doing PEG (for at least part of the parsing), but I think we could define a serviceable and minimally invasive algorithm for cleaning, and am just looking for conceptual inspiration on how to do that.

For instance:

sckott commented 5 years ago

hmm, not sure which of those could be done and still have good results.

to clarify, what is the "clean name" you are aiming for? is it:

Poa annua ssp. annua Smith 1912

becomes

Poa annua annua

cboettig commented 5 years ago

@scott great question, good to have a concrete example to work from.

Operationally, what I want from a "clean name" is a format that I can find an exact match for from a database provider, either as a recognized synonym or accepted name.

So, this means the "clean name" for Poa annua ssp. annua Smith 1912 is only meaningful relative to a given provider; i.e. is it in a format that matches the formats we see for ScientificNames (synonyms or accepted) in that provider's data.

So there is no one answer. Poa annua ssp. annua Smith 1912 could itself be considered "clean", if any of our data providers used that format. For instance, GBIF seems to use the format <genus> <specificEpithet> <intraspecficEpithet> with no qualifier such as var., ssp., or subsp. between the specificEpithet and the intraspecific Epithet, so for GBIF you would want Poa annua annua format. OTT meanwhile uses qualifiers, so "clean" would be Poa annua ssp. annua.

In practice, our copies of OTT and GBIF don't recognize these 'clean' names. Our GBIF (from taxizedb, at least as of this fall), recognizes four subspecies in Poa annua, and considers them all synonyms of Poa annua

taxadb::synonyms("Poa annua",  "gbif")
acceptedNameUsage synonym acceptedNameUsageID taxonRank  
Poa annua Ochlopoa annua raniglumis GBIF:2704179 species  
Poa annua Ochlopoa annua pilantha GBIF:2704179 species  
Poa annua Poa annua raniglumis GBIF:2704179 species  
Poa annua Ochlopoa raniglumis GBIF:2704179 species

It doesn't seem to have a record for Poa annua annua.

OTT recognizes Poa annua itself as an ambiguous synonym, which could be used to refer to species accepted name Poa supina or Poa infirma, and recognizes 36 synonyms (suv species and species) for these two names:

taxadb::synonyms("Poa annua",  "ott")
input acceptedNameUsage synonym acceptedNameUsageID taxonRank sort
poa annua Poa supina Poa duriuscula OTT:595905 species 1
poa annua Poa supina Poa annua var. supina OTT:595905 species 1
poa annua Poa supina Poa annua var. exigua OTT:595905 species 1
poa annua Poa supina Poa annua OTT:595905 species 1
poa annua Poa supina Poa annua subsp. supina OTT:595905 species 1
poa annua Poa supina Poa annua var. varia OTT:595905 species 1
poa annua Poa supina Poa supina subsp. foucaudii OTT:595905 species 1
poa annua Poa supina Poa annua subsp. varia OTT:595905 species 1
poa annua Poa supina Poa supina var. exigua OTT:595905 species 1
poa annua Poa supina Poa ustulata OTT:595905 species 1
poa annua Poa supina Ochlopoa supina OTT:595905 species 1
poa annua Poa supina Poa foucaudii OTT:595905 species 1
poa annua Poa supina Poa supina subsp. ustulata OTT:595905 species 1
poa annua Poa supina Poa supina var. allobrogensis OTT:595905 species 1
poa annua Poa supina Poa rivulorum OTT:595905 species 1
poa annua Poa supina Ochlopoa rivulorum OTT:595905 species 1
poa annua Poa supina Poa bifida OTT:595905 species 1
poa annua Poa supina Poa exigua OTT:595905 species 1
poa annua Poa infirma Poa annua OTT:254228 species 1
poa annua Poa infirma Poa annua var. exilis OTT:254228 species 1
poa annua Poa infirma Colpodium thomsonii OTT:254228 species 1
poa annua Poa infirma Poa remotiflora OTT:254228 species 1
poa annua Poa infirma Poa annua var. maroccana OTT:254228 species 1
poa annua Poa infirma Poa annua var. remotiflora OTT:254228 species 1
poa annua Poa infirma Poa annua var. tommasinii OTT:254228 species 1
poa annua Poa infirma Poa annua subsp. exilis OTT:254228 species 1
poa annua Poa infirma Poa exilis OTT:254228 species 1
poa annua Poa infirma Poa maroccana OTT:254228 species 1
poa annua Poa infirma Poa inconspicua OTT:254228 species 1
poa annua Poa infirma Ochlopoa perinconspicua OTT:254228 species 1
poa annua Poa infirma Ochlopoa maroccana OTT:254228 species 1
poa annua Poa infirma Ochlopoa infirma OTT:254228 species 1
poa annua Poa infirma Poa perinconspicua OTT:254228 species 1
poa annua Poa infirma Megastachya infirma OTT:254228 species 1
poa annua Poa infirma Catabrosa thomsonii OTT:254228 species 1
poa annua Poa infirma Eragrostis infirma OTT:254228 species 1

 

Neither knows anything about Poa annua ssp. annua Smith 1912 or Poa annua ssp. annua.

COL has 49 possible synonyms

taxadb::synonyms("Poa annua",  "col")

Global names resolver tells me that it does resolve Poa annua var. annua in GBIF as a synonym to Poa annua L., (the L. simply being an authority reference to Linneaus), which is to say that GNR is effectively just resolving it to the species name as well.

ITIS online (but not our dump) does recognize the Poa annua var annua L. as a known synonym for Poa annua -- apparently Linnaeus and not Smith is getting the credit this time...

All of this is to say that even if we could correctly clean the string to the subspecies or variety name format used by a given provider (e.g. Poa annua var. annua Smith for ITIS, Poa annua annua for GBIF, Poa annua subsp. annua for OTT, etc).. none of those would match, and even if they did match, we would only resolve an accepted ID to the Species level identifier for Poa annua anyway. That is, we would have made the same conclusion had we simply hacked off things down to the binomial name.

As the examples above show, we can still observe that the binomial name can be either ambiguous or include many varieties, and we can also see that the varieties returned don't match the original provided variety name from Smith.

Many applications are done at the species level (or above), i.e. some macro-ecology in which you measure species richness, species traits, etc, so resolving the variety to the species level will be defensible some of the time, and seems to be baked in to the use of synonym definitions here...