ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

scientific name authors? #100

Open joelnitta opened 2 years ago

joelnitta commented 2 years ago

The data-sources vignette mentions

While DWC encourages the use of authorship citations, these are intentionally omitted in most tables as inconsistency in abbreviations and formatting make names with authors much harder to resolve. When available, this information is provided in the additional optional columns using the corresponding Darwin Core terms.

However, although most of the data sources supported by taxadb do have scientific author name data, it does not seem to be provided in all of the taxadb databases. I have been able to verify this in ITIS and NCBI at a minimum. Futhermore, it is not clear if authorship was available what field it would show up in.

Although the vignette cites the presence of author names making name resolution more difficult as the reason not to include them, the opposite is also true. The author of a scientific name can be very important for resolving names, particularly in the case of ambiguous synonyms: names that are synonyms (thus pointing to different names) and have identical genus and specific epithet, but different authors. There is no way to distinguish these without author. And what is worse, code that matches on identical scientific names could lead one to completely different entities.

Would it be possible to add scientificNameAuthorship? That way there would be a standardized way to provide authorship data without polluting scientificName.

(related to https://github.com/ropensci/taxadb/issues/11)

cboettig commented 2 years ago

Thanks, yeah, great question. Short answer is yes, we could and probably should do this, just need to figure out the details.

Our recent Catalogue of Life (col) tables now use namePublishedIn to this purpose. (Note that technically COL includes names from ITIS and NCBI, among other sources). I agree it would make sense to add this to other sources where available.

Arguably we should attempt to report both namePublishedIn and scientificNameAuthorship, though these terms don't appear to be distinguished all that precisely in Darwin Core. Arguably if only an author name is present it may not really consist of a valid namePublishedIn record, but given the utter lack of standardization in how these strings are written I suppose it doesn't really matter.

I definitely agree with the case you highlight about ambiguous synonyms. Arguably this is why some databases, like ITIS (maybe NCBI too, I forget), assign TaxonIDs to synonyms as well as accepted names, though not all databases do and that practice can be confusing. Two such ambiguous synonyms with identical genus/specific epithet but different authors would also have different taxonIDs. (My hesitancy with author strings has always been that technically 'matching' author strings are sometimes formatted inconsistently, even within a single provider's database; so comaring to the ID seems preferable). e.g. witness:

library(dplyr)
library(taxadb)
itis <- taxa_tbl("itis")

## all synonyms that share a genus/specificEpithet
itis %>% 
   filter(taxonomicStatus != "accepted", taxonRank == "species", !is.na(specificEpithet)) %>%
   count(genus, specificEpithet) %>% filter(n==2)
# Source:   lazy query [?? x 3]
# Database: duckdb_connection
# Groups:   genus
   genus       specificEpithet     n
   <chr>       <chr>           <dbl>
 1 Dendropicos obsoletus           2
 2 Eoctenes    spasmae             2
 3 Naso        tonganus            2
 4 Siganus     vermiculatus        2
 5 Crenarctus  bicuspidatus        2
 6 Alpheus     sulcatus            2
 7 Etheostoma  serrifer            2
 8 Basilinna   leucotis            2
 9 Lophornis   helenae             2
10 Polyerata   amabilis            2
# … with more rows

Note however that in all cases the taxonID is unique:

> itis %>% filter(taxonomicStatus != "accepted", taxonRank == "species", !is.na(specificEpithet)) %>% count(taxonID, genus, specificEpithet) %>% filter(n==2)
# Source:   lazy query [?? x 4]
# Database: duckdb_connection
# Groups:   taxonID, genus
# … with 4 variables: taxonID <chr>, genus <chr>, specificEpithet <chr>, n <dbl>

I agree we should have the authorship info, but meanwhile a user could at least look up the taxonIDs online :-( so the information isn't truly lost.

joelnitta commented 2 years ago

So -- short term: see if the namePublishedIn column in col helps, and if the taxonID of the synonym helps resolve ambiguity.

I tried downloading col with taxalight but got an error:

> tl_create("col")
(/) Importing chunk 9 to LMDB... elapsed:  1mError in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :      
  line 7599 did not have 18 elements
> packageVersion("taxalight")
[1] ‘0.1.5’

(this might need its own separate issue)

Let me know if you prefer COL's convention or if you prefer the column name scientificNameAuthorship, or some combination thereof (and why!).

As mentioned in DWC, namePublishedIn refers to the "reference for the publication in which the scientificName was originally established". So I think it's pretty clear that namePublishedIn contains the publication, and scientificNameAuthorship is just the author.

Here is an example of using both scientificNameAuthorship and namePublishedIn for Hypodematium taiwanensis Ching ex K. H. Shing

<start digression>

BTW, I have been working on an R package to join names across databases that can handle the annoying inconsistencies of taxonomic authors: taxastand.

It uses DWC as the standard for the taxonomic reference. So it would potentially work very well with taxadb or taxalight to enable matching that takes into account author names (assuming author names become available). I would greatly appreciate it if you could take a look and let me know what you think!

<end digression>

cboettig commented 2 years ago

thanks, I'll need to check on taxalight parser, it's stricter than taxadb due to LMDB. Does taxadb::taxa_tbl("col") work for you?

Re scientificNameAuthorship, the Darwin Core examples include an example with year, as well as without year, and the notion of author and year as "authorship" is quite common. Since the combination of an author an a year usually denotes a publication, I think the distinction is ambiguous to me. As you'll see in the COL data, all variation of formats exist with their use of namePublishedIn as well, so I don't think it would be simple to parse out authorship vs publication cleanly in an automated pipeline.

Thanks for sharing taxastand, that looks cool. However, I currently believe it is not generally a good idea to join names across the providers. We originally explored something like this in taxadb, e.g. given a list of scientific names, a user might be able to 'resolve' a larger fraction of them to accepted names by looking for matches in not just one but all of the taxadb tables, or looking for synonyms of the name and then trying to resolve those in the other tables, etc. However, this can easily lead to nonsense. Databases declare conflicting statements about certain taxa -- e.g. what some providers consider a synonym, other providers consider the accepted name (or an accepted name of a different species). Experts can disagree about taxonomy, and thus so can databases -- any database of names is essentially it's own internally consistent (hopefully) set of "taxonomic concepts." (of course any one provider database also changes over time as names are adjusted, added and removed). We cannot treat taxonomic database as assemblages of "facts" about the natural world, where any observer would come up with the same facts and the facts would remain fixed. As you point out, this is essentially a limitation with Darwin Core, which is discussed much better than I can manage in this paper: https://doi.org/10.1093/database/bax100, among many others. As those authors observe, there are ontologies that are richer than darwin core which can better express taxon-concepts distinct from taxonIDs, allowing a database to indicate the potentially different usages of the same scientific name by different providers.

Also note that several providers are already built by combining many databases -- COL uses over 100 other databases, including ITIS and NCBI in determining their taxonomy, so is OTT and GBIF. But this also means they need to consult literature or expert systematists to resolve conflicts that result -- hence you will find that COL does not always agree 1:1 with ITIS or NCBI either (the synthesis process takes a lot of time too, so it usually lags behind the versions of ITIS or NCBI we access directly). In general if someone wants a 'fully integrated' database of names, COL might be a good place to start.

joelnitta commented 2 years ago

I got another error with taxadb:

> taxadb::taxa_tbl("col")
Error in initialize(value, ...) : 
  duckdb_startup_R: Failed to open database: IO Error: Trying to read a database file with version number 25, but we can only read version 27.
The database file was created with an older version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.
> packageVersion("taxadb")
[1] ‘0.1.3’

Re scientificNameAuthorship, the Darwin Core examples include an example with year, as well as without year, and the notion of author and year as "authorship" is quite common.

I believe that mainly stems from the different practices between botanists and other taxonomists. Botanists tend not to use the year, and zoologists tend to include the year.

As you'll see in the COL data, all variation of formats exist with their use of namePublishedIn as well, so I don't think it would be simple to parse out authorship vs publication cleanly in an automated pipeline.

Regardless of how scientificNameAuthorship and namePublishedIn end up getting treated in databases, I would argue that they are two distinct concepts to most taxonomists. For example, if you search for any name in IPNI, you will find it provides author and publication as separate fields. For the practical purposes of matching names, I don't think you need publication, but author is needed as I described previously.

However, I currently believe it is not generally a good idea to join names across the providers.

With taxastand, the idea is not necessarily to join names across providers (that could be an "off-label" usage, but I agree it is not a good idea for all the reasons you mentioned). Rather, it is to standardize names across data sources. This has potentially some of the same pitfalls, but I think in a typical use case there are many fewer names than trying to merge complete taxonomic databases, so they present less of a danger. For example, if I wanted to build a tree with genbank data, then map traits from a trait database onto the tree. There will surely be synonyms that need to be resolved to match the two datasources. taxastand provides a way to do this by matching them both to the same taxonomic standard (which could be e.g. a single database provided by taxadb). taxastand mostly differs from other similar packages because it can handle variation in author names, and allows using a custom standard taxonomy (not just choosing from ITIS, COL, etc).

cboettig commented 2 years ago

thanks. Yes, good point about the botany/zoology divide, and I certainly agree authors and papers aren't the same thing. In an ideal world all the papers would have DOIs, all the authors would have ORCID IDs, and dealing with these references would be less messy! Certainly a package that knows how to parse all the variation in author and publication strings would be great!

For example, if I wanted to build a tree with genbank data, then map traits from a trait database onto the tree. There will surely be synonyms that need to be resolved to match the two datasources. taxastand provides a way to do this by matching them both to the same taxonomic standard

This sounds good to me, but I think it is precisely ITIS was created, and NCBI, COL, OTT, etc (i.e. they all recognize various synonyms used in various places and try to standardize them, precisely as you describe). It sounds like taxastand would be another such collection. Not that there aren't good reasons to add new databases, but maybe we're in some version of https://xkcd.com/927/ already?

joelnitta commented 2 years ago

xkcd well taken :)

Sorry if I haven't been clear about the purpose of taxastand, but it isn't to provide yet another "standard". It doesn't provide any data at all---it lets the user provide whatever they want, locally. I made this package starting from my own use-case (like most package creators do, I suppose). None of the existing "standard" databases (NCBI, COL, OTT, etc) worked out-of-box for my purposes, so I needed to create my own (specifically, I modified part of COL). Acknowledging the reality that there will never be "one taxonomic database to rule them all", I think it is useful to allow the user to provide their own data that works for their particular situation.

Also, as I already mentioned, the other major feature of taxastand is matching that takes into account variation in taxonomic author names, a feature that I think is lacking from other taxonomic name resolution packages (it also does fuzzy matching).