ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

Hybrids not denoted correctly in Catalogue of Life db #122

Closed jtmiller28 closed 1 month ago

jtmiller28 commented 2 months ago

Hello,

A collaborator brought it to my attention that the taxadb's catalogue of life database has some odd returns concerning particular hybrids of plants.

E.g. Encyclia × nizandensis Pérez-García & Hágsater comes up as Encyclia nizandensis with acceptedNameUsageID: COL:39PKV, while searching catalogue of life's webtool will correctly show this name as its hybrid form w/the '×'. My current code to retrieve names from col in taxadb is: `library(taxadb) library(dplyr) library(data.table) col <- taxa_tbl("col") %>% select(scientificName,taxonRank,acceptedNameUsageID,taxonomicStatus) %>% filter(taxonRank == "species")

Load into memory, convert to data.table for faster operations

col_names <- col %>% collect() # load into memory col_names <- as.data.table(col_names)`

Is there a way to retrieve the hybrid variations of these names?

Thanks!

cboettig commented 1 month ago

Thanks for raising this. Note that we need an update of the backend data, the current exports of COL are the 2022 (Dec 2022 records).

This may or may not explain the discrepancy though. We have tried to do a little standardization across how scientific names are listed in the scientificName column, especially as we are aimed at database type operations where predictable string matching can be important. This creates difficulties across the various practices in different catalogs, which follow different conventions -- including listing hybrid names signified by x in the scientificName, or insisting the naming authority be listed (i.e. usually a 'citation', such as an author last name, possibly year, with varying syntax regarding the use of parentheses, punctuation, and non-utf-8 characters etc).

DarwinCore continues to evolve on this, and now provides http://rs.tdwg.org/dwc/terms/verbatimIdentification in addtion to http://rs.tdwg.org/dwc/terms/scientificName (which is distinct from https://dwc.tdwg.org/list/#dwc_acceptedScientificName where I think this hybrid notation is not proper), and you can see recent (e.g. more recent than the current taxadb release) changes to darwinCore regarding these names, e.g. in the 2023 Darwin Core updates, as discussed in https://github.com/tdwg/dwc/issues/392). Anyway, point being that taxadb faces an obvious challenge here in working across multiple naming providers which follow the dwc standard to varying degrees, while providers own practices change over time and so do the standards (though often not in lockstep!) But as you know, this is just the reality of taxonomy, which has always been a complex and dynamic thing.

Anyway, apologies for getting behind on the updates, I'll try and prepare a new version soon. Continued advances in technology are making it easier to distribute these larger parquet collections without the need for 'sharding' employed in the last release, but also makes for a moving target. Meanwhile you may prefer to just access the COL snapshots directly: https://www.catalogueoflife.org/data/download. (Please note until somewhat recently, COL acceptedNameUsageID was not standardized across releases, which made the earlier IDs almost meaningless. )

jtmiller28 commented 1 month ago

Hmm understood, yeah hybrids have been a consistent mess in my research so far (usually they are thrown out due since I am skeptical of harmonizing their names and pulling at scale). Interested to see how this develops with dwc standards down the line...

In anycase, thanks for providing the COL snapshot, I'll see if that can remedy those 700 or so strange cases we have causing conflicts.

Best, JT