Closed cboettig closed 4 years ago
a clean column seems like a good idea in general. lowercasing seems safe. just using first two words will be a problem when matching anything with subspecific ranks, etc. And then there's unicode
stringi
translation to ascii characters:
e.g. stringi::stri_trans_general(names, "latin-ascii")
lowercase
is already implemented now, just for joins. Could be pre-computed but is fast already.(Note that in GBIF we already do:
https://github.com/cboettig/taxadb/blob/413e6cf6e388e8015b15e72969100f48422047a4/data-raw/gbif.R#L33
we should really be preserving scientificName there, and parsing it into authority+canonicalName; I wonder if GBIF is doing that manually or by script? @sckott ?
it's possible they use https://www.gbif.org/developer/species#parser but I don't know for sure
@sckott thanks, that does look promising... have you ever tried to track the source code down for that?
no, but its almost 100% going to be Java since they are a Java shop. this comes back to my attempt at C++ name parsing https://github.com/ropenscilabs/pegax but is stalled to due it being hard, and the frustrating situation with https://gitlab.com/gogna/gnparser/ being hard to wrap in R because it's built as a server, gnparser is the parsing behind taxize::gnr_resolve
for http://resolver.globalnames.org/
yeah, I was mostly hoping to reverse-engineer (some of) the java or whatever rather than wrap it, for reasons you've pointed out already. e.g. I know globalnames is doing PEG (for at least part of the parsing), but I think we could define a serviceable and minimally invasive algorithm for cleaning, and am just looking for conceptual inspiration on how to do that.
For instance:
()
, []
, etc?[a-z]\\.\\s
)?hmm, not sure which of those could be done and still have good results.
to clarify, what is the "clean name" you are aiming for? is it:
Poa annua ssp. annua Smith 1912
becomes
Poa annua annua
@scott great question, good to have a concrete example to work from.
Operationally, what I want from a "clean name" is a format that I can find an exact match for from a database provider, either as a recognized synonym or accepted name.
So, this means the "clean name" for Poa annua ssp. annua Smith 1912
is only meaningful relative to a given provider; i.e. is it in a format that matches the formats we see for ScientificNames (synonyms or accepted) in that provider's data.
So there is no one answer. Poa annua ssp. annua Smith 1912
could itself be considered "clean", if any of our data providers used that format. For instance, GBIF seems to use the format <genus> <specificEpithet> <intraspecficEpithet>
with no qualifier such as var.
, ssp.
, or subsp.
between the specificEpithet and the intraspecific Epithet, so for GBIF you would want Poa annua annua
format. OTT meanwhile uses qualifiers, so "clean" would be Poa annua ssp. annua
.
In practice, our copies of OTT and GBIF don't recognize these 'clean' names. Our GBIF (from taxizedb, at least as of this fall), recognizes four subspecies in Poa annua
, and considers them all synonyms of Poa annua
taxadb::synonyms("Poa annua", "gbif")
acceptedNameUsage |
synonym |
acceptedNameUsageID |
taxonRank |
|
---|---|---|---|---|
Poa annua | Ochlopoa annua raniglumis | GBIF:2704179 | species | |
Poa annua | Ochlopoa annua pilantha | GBIF:2704179 | species | |
Poa annua | Poa annua raniglumis | GBIF:2704179 | species | |
Poa annua | Ochlopoa raniglumis | GBIF:2704179 | species |
It doesn't seem to have a record for Poa annua annua
.
OTT recognizes Poa annua
itself as an ambiguous synonym, which could be used to refer to species accepted name Poa supina
or Poa infirma
, and recognizes 36 synonyms (suv species and species) for these two names:
taxadb::synonyms("Poa annua", "ott")
input |
acceptedNameUsage |
synonym |
acceptedNameUsageID |
taxonRank |
sort |
---|---|---|---|---|---|
poa annua | Poa supina | Poa duriuscula | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua var. supina | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua var. exigua | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua subsp. supina | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua var. varia | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa supina subsp. foucaudii | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa annua subsp. varia | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa supina var. exigua | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa ustulata | OTT:595905 | species | 1 |
poa annua | Poa supina | Ochlopoa supina | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa foucaudii | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa supina subsp. ustulata | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa supina var. allobrogensis | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa rivulorum | OTT:595905 | species | 1 |
poa annua | Poa supina | Ochlopoa rivulorum | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa bifida | OTT:595905 | species | 1 |
poa annua | Poa supina | Poa exigua | OTT:595905 | species | 1 |
poa annua | Poa infirma | Poa annua | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa annua var. exilis | OTT:254228 | species | 1 |
poa annua | Poa infirma | Colpodium thomsonii | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa remotiflora | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa annua var. maroccana | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa annua var. remotiflora | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa annua var. tommasinii | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa annua subsp. exilis | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa exilis | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa maroccana | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa inconspicua | OTT:254228 | species | 1 |
poa annua | Poa infirma | Ochlopoa perinconspicua | OTT:254228 | species | 1 |
poa annua | Poa infirma | Ochlopoa maroccana | OTT:254228 | species | 1 |
poa annua | Poa infirma | Ochlopoa infirma | OTT:254228 | species | 1 |
poa annua | Poa infirma | Poa perinconspicua | OTT:254228 | species | 1 |
poa annua | Poa infirma | Megastachya infirma | OTT:254228 | species | 1 |
poa annua | Poa infirma | Catabrosa thomsonii | OTT:254228 | species | 1 |
poa annua | Poa infirma | Eragrostis infirma | OTT:254228 | species | 1 |
Neither knows anything about Poa annua ssp. annua Smith 1912
or Poa annua ssp. annua
.
COL has 49 possible synonyms
taxadb::synonyms("Poa annua", "col")
Global names resolver tells me that it does resolve Poa annua var. annua
in GBIF as a synonym to Poa annua L., (the L.
simply being an authority reference to Linneaus), which is to say that GNR is effectively just resolving it to the species name as well.
ITIS online (but not our dump) does recognize the Poa annua var annua L.
as a known synonym for Poa annua
-- apparently Linnaeus and not Smith is getting the credit this time...
All of this is to say that even if we could correctly clean the string to the subspecies or variety name format used by a given provider (e.g. Poa annua var. annua Smith
for ITIS, Poa annua annua
for GBIF, Poa annua subsp. annua
for OTT, etc).. none of those would match, and even if they did match, we would only resolve an accepted ID to the Species level identifier for Poa annua anyway. That is, we would have made the same conclusion had we simply hacked off things down to the binomial name.
As the examples above show, we can still observe that the binomial name can be either ambiguous or include many varieties, and we can also see that the varieties returned don't match the original provided variety name from Smith.
Many applications are done at the species level (or above), i.e. some macro-ecology in which you measure species richness, species traits, etc, so resolving the variety to the species level will be defensible some of the time, and seems to be baked in to the use of synonym definitions here...
@karinorman & @sckott
Our current strategy matches requested names directly against the
scientificName
field, which is whatever name we get from the original names provider. I'm thinking we should create a new column of 'cleaned' names that would potentially not be part of the actual schema, but might be easier to match against in JOINS. The trick is to do this in a way that is still true to the original database and doesn't implicitly introduce some hidden assumptions.For example:
In the
lowercase
branch, I've added a step which first does amutate(input = tolower(scientificName))
on the stored database, simply to create a lowercase version of all the scientificNames (at any rank), so that we can do case-insensitive joins by also lowercasing the input query before running the join. Currently this is done whenids()
is called, so it doesn't involve altering the Darwin Core records fromdata-raw
. That adds un-necessary overhead computation, though only a second or so sincemutate
withtolower
is actually pretty fast here: when the data is in an external database,dplyr
is translating this in the SQLtolower
method, not applying the base R method.This makes joins case-insensentive, but we still have other cases where we will want a join to succeed but it doesn't. For instance, OTT has the synonym
Chondria tenuissima (Withering) C.Agardh, 1817
, which means thatChondria tenuissima
fails to return a match. However, just hacking off anything after the first two words might be implicitly introducing taxonomic assumptions that are not guaranteed (compared to being case- insensitive, which seems like a safer assumption).The function
clean_names()
is intended to be applied to input names, and it can optionally do things like binomial-ize names to make this matching easier, but to actually get matches it should probably be done to species names as well (as an additional column). Opening this for us to think more about this issue.