cboettig commented 5 years ago

@karinorman & @sckott

Our current strategy matches requested names directly against the scientificName field, which is whatever name we get from the original names provider. I'm thinking we should create a new column of 'cleaned' names that would potentially not be part of the actual schema, but might be easier to match against in JOINS. The trick is to do this in a way that is still true to the original database and doesn't implicitly introduce some hidden assumptions.

For example:

In the lowercase branch, I've added a step which first does a mutate(input = tolower(scientificName)) on the stored database, simply to create a lowercase version of all the scientificNames (at any rank), so that we can do case-insensitive joins by also lowercasing the input query before running the join. Currently this is done when ids() is called, so it doesn't involve altering the Darwin Core records from data-raw. That adds un-necessary overhead computation, though only a second or so since mutate with tolower is actually pretty fast here: when the data is in an external database, dplyr is translating this in the SQL tolower method, not applying the base R method.

This makes joins case-insensentive, but we still have other cases where we will want a join to succeed but it doesn't. For instance, OTT has the synonym Chondria tenuissima (Withering) C.Agardh, 1817, which means that Chondria tenuissima fails to return a match. However, just hacking off anything after the first two words might be implicitly introducing taxonomic assumptions that are not guaranteed (compared to being case- insensitive, which seems like a safer assumption).

The function clean_names() is intended to be applied to input names, and it can optionally do things like binomial-ize names to make this matching easier, but to actually get matches it should probably be done to species names as well (as an additional column). Opening this for us to think more about this issue.

sckott commented 5 years ago

a clean column seems like a good idea in general. lowercasing seems safe. just using first two words will be a problem when matching anything with subspecific ranks, etc. And then there's unicode

cboettig commented 5 years ago

33 adds mutate_db support for generic operations

So unicode I think we can clean reasonably with stringi translation to ascii characters: e.g. stringi::stri_trans_general(names, "latin-ascii")
lowercase is already implemented now, just for joins. Could be pre-computed but is fast already.
Wish we had a regex or similar strategy for dealing with author / authority bits. It would be nice to preserve sub-species names while still getting a more 'canonical' name format to match against.

(Note that in GBIF we already do:

https://github.com/cboettig/taxadb/blob/413e6cf6e388e8015b15e72969100f48422047a4/data-raw/gbif.R#L33

we should really be preserving scientificName there, and parsing it into authority+canonicalName; I wonder if GBIF is doing that manually or by script? @sckott ?

sckott commented 5 years ago

it's possible they use https://www.gbif.org/developer/species#parser but I don't know for sure

cboettig commented 5 years ago

@sckott thanks, that does look promising... have you ever tried to track the source code down for that?

sckott commented 5 years ago

no, but its almost 100% going to be Java since they are a Java shop. this comes back to my attempt at C++ name parsing https://github.com/ropenscilabs/pegax but is stalled to due it being hard, and the frustrating situation with https://gitlab.com/gogna/gnparser/ being hard to wrap in R because it's built as a server, gnparser is the parsing behind taxize::gnr_resolve for http://resolver.globalnames.org/

cboettig commented 5 years ago

yeah, I was mostly hoping to reverse-engineer (some of) the java or whatever rather than wrap it, for reasons you've pointed out already. e.g. I know globalnames is doing PEG (for at least part of the parsing), but I think we could define a serviceable and minimally invasive algorithm for cleaning, and am just looking for conceptual inspiration on how to do that.

For instance:

can we drop numeric codes (though I think they are common in NCBI names for purposes other than authority publication year)...
can we drop (or translate to spaces and then compact the whitespace) any non-alphanumeric characters?
can we drop text in () , [] , etc?
can we drop or otherwise do something with things that look like abbreviations? ([a-z]\\.\\s)?
...

sckott commented 5 years ago

hmm, not sure which of those could be done and still have good results.

to clarify, what is the "clean name" you are aiming for? is it:

Poa annua ssp. annua Smith 1912

becomes

Poa annua annua

cboettig commented 5 years ago

@scott great question, good to have a concrete example to work from.

Operationally, what I want from a "clean name" is a format that I can find an exact match for from a database provider, either as a recognized synonym or accepted name.

So, this means the "clean name" for Poa annua ssp. annua Smith 1912 is only meaningful relative to a given provider; i.e. is it in a format that matches the formats we see for ScientificNames (synonyms or accepted) in that provider's data.

So there is no one answer. Poa annua ssp. annua Smith 1912 could itself be considered "clean", if any of our data providers used that format. For instance, GBIF seems to use the format <genus> <specificEpithet> <intraspecficEpithet> with no qualifier such as var., ssp., or subsp. between the specificEpithet and the intraspecific Epithet, so for GBIF you would want Poa annua annua format. OTT meanwhile uses qualifiers, so "clean" would be Poa annua ssp. annua.

In practice, our copies of OTT and GBIF don't recognize these 'clean' names. Our GBIF (from taxizedb, at least as of this fall), recognizes four subspecies in Poa annua, and considers them all synonyms of Poa annua

taxadb::synonyms("Poa annua",  "gbif")

acceptedNameUsage	synonym	acceptedNameUsageID	taxonRank
Poa annua	Ochlopoa annua raniglumis	GBIF:2704179	species
Poa annua	Ochlopoa annua pilantha	GBIF:2704179	species
Poa annua	Poa annua raniglumis	GBIF:2704179	species
Poa annua	Ochlopoa raniglumis	GBIF:2704179	species

It doesn't seem to have a record for Poa annua annua.

OTT recognizes Poa annua itself as an ambiguous synonym, which could be used to refer to species accepted name Poa supina or Poa infirma, and recognizes 36 synonyms (suv species and species) for these two names:

taxadb::synonyms("Poa annua",  "ott")

input	acceptedNameUsage	synonym	acceptedNameUsageID	taxonRank	sort
poa annua	Poa supina	Poa duriuscula	OTT:595905	species	1
poa annua	Poa supina	Poa annua var. supina	OTT:595905	species	1
poa annua	Poa supina	Poa annua var. exigua	OTT:595905	species	1
poa annua	Poa supina	Poa annua	OTT:595905	species	1
poa annua	Poa supina	Poa annua subsp. supina	OTT:595905	species	1
poa annua	Poa supina	Poa annua var. varia	OTT:595905	species	1
poa annua	Poa supina	Poa supina subsp. foucaudii	OTT:595905	species	1
poa annua	Poa supina	Poa annua subsp. varia	OTT:595905	species	1
poa annua	Poa supina	Poa supina var. exigua	OTT:595905	species	1
poa annua	Poa supina	Poa ustulata	OTT:595905	species	1
poa annua	Poa supina	Ochlopoa supina	OTT:595905	species	1
poa annua	Poa supina	Poa foucaudii	OTT:595905	species	1
poa annua	Poa supina	Poa supina subsp. ustulata	OTT:595905	species	1
poa annua	Poa supina	Poa supina var. allobrogensis	OTT:595905	species	1
poa annua	Poa supina	Poa rivulorum	OTT:595905	species	1
poa annua	Poa supina	Ochlopoa rivulorum	OTT:595905	species	1
poa annua	Poa supina	Poa bifida	OTT:595905	species	1
poa annua	Poa supina	Poa exigua	OTT:595905	species	1
poa annua	Poa infirma	Poa annua	OTT:254228	species	1
poa annua	Poa infirma	Poa annua var. exilis	OTT:254228	species	1
poa annua	Poa infirma	Colpodium thomsonii	OTT:254228	species	1
poa annua	Poa infirma	Poa remotiflora	OTT:254228	species	1
poa annua	Poa infirma	Poa annua var. maroccana	OTT:254228	species	1
poa annua	Poa infirma	Poa annua var. remotiflora	OTT:254228	species	1
poa annua	Poa infirma	Poa annua var. tommasinii	OTT:254228	species	1
poa annua	Poa infirma	Poa annua subsp. exilis	OTT:254228	species	1
poa annua	Poa infirma	Poa exilis	OTT:254228	species	1
poa annua	Poa infirma	Poa maroccana	OTT:254228	species	1
poa annua	Poa infirma	Poa inconspicua	OTT:254228	species	1
poa annua	Poa infirma	Ochlopoa perinconspicua	OTT:254228	species	1
poa annua	Poa infirma	Ochlopoa maroccana	OTT:254228	species	1
poa annua	Poa infirma	Ochlopoa infirma	OTT:254228	species	1
poa annua	Poa infirma	Poa perinconspicua	OTT:254228	species	1
poa annua	Poa infirma	Megastachya infirma	OTT:254228	species	1
poa annua	Poa infirma	Catabrosa thomsonii	OTT:254228	species	1
poa annua	Poa infirma	Eragrostis infirma	OTT:254228	species	1

Neither knows anything about Poa annua ssp. annua Smith 1912 or Poa annua ssp. annua.

COL has 49 possible synonyms

taxadb::synonyms("Poa annua",  "col")

Global names resolver tells me that it does resolve Poa annua var. annua in GBIF as a synonym to Poa annua L., (the L. simply being an authority reference to Linneaus), which is to say that GNR is effectively just resolving it to the species name as well.

ITIS online (but not our dump) does recognize the Poa annua var annua L. as a known synonym for Poa annua -- apparently Linnaeus and not Smith is getting the credit this time...

All of this is to say that even if we could correctly clean the string to the subspecies or variety name format used by a given provider (e.g. Poa annua var. annua Smith for ITIS, Poa annua annua for GBIF, Poa annua subsp. annua for OTT, etc).. none of those would match, and even if they did match, we would only resolve an accepted ID to the Species level identifier for Poa annua anyway. That is, we would have made the same conclusion had we simply hacked off things down to the binomial name.

As the examples above show, we can still observe that the binomial name can be either ambiguous or include many varieties, and we can also see that the varieties returned don't match the original provided variety name from Smith.

Many applications are done at the species level (or above), i.e. some macro-ecology in which you measure species richness, species traits, etc, so resolving the variety to the species level will be defensible some of the time, and seems to be baked in to the use of synonym definitions here...

ropensci / taxadb

Consider a "clean" name column for matching against #30

33 adds mutate_db support for generic operations