ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

Represent changes to taxonomy through time? #136

Open sckott opened 6 years ago

sckott commented 6 years ago

question from @noamross

he asks if there is a standard way of representing changes to taxonomy: species being merged, split, added, deleted, etc, over time

@zachary-foster as far as i know we don't have anything like this in taxa, correct? Do you deal much with changes in names through time in taxa you work with?

I don't think the darwin core standard deals with this, and not sure of any data standard like this out there. Of course one could just make something up if nothing exists

noamross commented 6 years ago

Thanks @sckott. Some more info:

We're considering making a package to wrap data from ICTV, which defines viral taxonomy. This is a pretty volatile taxonomic area, and there are annual releases of the database with changes. So if a virus name is found in an old publication, it may not be in the current taxonomy. We'd like to be able to resolve it to it's place in the current taxonomy. So a question the package might answer is: "What is the current Genus of virus known in 2001 as X". Or, "What are all the previous names of virus(es) that are now known as Y?" I'm wondering if there's a common approach to dealing with this.

For instance, Here's an example of a convoluted history of a current species, which other species have been merged to, have been renamed, etc. https://talk.ictvonline.org/taxonomy/p/taxonomy-history?taxnode_id=20171861 .

For now, if we pursue this we might just start with allowing users to specify which release of the ICTV data to work with, and use the taxa framework to define relationships.

I'll probably try to get in touch with the person who ICTV to see how they handle it. They have some way to represent this in the database that powers the website, even through the data releases themselves sort of break the links.

zachary-foster commented 6 years ago

Hi @noamross, this is an interesting problem. I can imagine a couple of solutions, depending on how flexible things need to be. The easiest I can imagine would be to concatenate all of the mater species lists:

https://talk.ictvonline.org/files/master-species-lists/

into a single table with an extra column indicating the version number and using parse_tax_data like so (only one dataset shown):

> obj <- parse_tax_data(raw_data, class_cols = c("Order", "Family", "Genus", "Species"))
> obj
<Taxmap>
  5299 taxa: aab. Bunyavirales, aac. Caudovirales, aad. Herpesvirales ... hvu. Pepper ringspot virus, hvv. Tobacco rattle virus
  5299 edges: NA->aab, NA->aac, NA->aad, NA->aae, NA->aaf, NA->aag ... bik->hvq, bik->hvr, bik->hvs, bil->hvt, bil->hvu, bil->hvv
  1 data sets:
    tax_data:
    # A tibble: 4,404 x 15
      taxon_id  Sort Order  Family  Subfamily Genus  Species    `Type Species?` `Exemplar Accession… `Exemplar Isolate` `Genome Composi…
      <chr>    <dbl> <chr>  <chr>   <chr>     <chr>  <chr>                <dbl> <chr>                <chr>              <chr>           
    1 bim       1.00 Bunya… Feravi… <NA>      Ortho… Ferak ort…            1.00 L:KP710246,M:KP7102… ferak virus C51-C… ssRNA(-)        
    2 bin       2.00 Bunya… Fimovi… <NA>      Emara… Actinidia…            0    RNA1:KT861481,RNA2:… Actinidia chlorot… ssRNA(-)        
    3 bio       3.00 Bunya… Fimovi… <NA>      Emara… European …            1.00 RNA1: AY563040, RNA… Mielke             ssRNA(-)        
    # ... with 4,401 more rows, and 4 more variables: `Last Change` <chr>, `MSL of Last Change` <dbl>, Proposal <chr>, `Taxon History
    #   URL` <chr>
  0 functions:

One you do that, you should have a taxonomy with every taxon that ever existed. To get a specific version, you then just subset it by that extra column (not shown above); something like:

taxa_in_version <- obj$data$tax_data$taxon_id[obj$data$tax_data$db_version == "my_fav_version"]
filter_taxa(obj, taxon_ids %in% taxa_in_version)

That would subset both the taxonomy and the associated table to just that version. That doesn't explicitly track changes through time, but all the data is there in a manageable form at least and I think it could be used to answer your example questions:

What is the current Genus of virus known in 2001 as X

  1. Find virus X in 2001 version
  2. Find accession number for virus X
  3. Lookup the same accession numbers in current version
  4. Return genus for virus with accession number in current version

What are all the previous names of virus(es) that are now known as Y?

  1. Lookup the current accession numbers for virus Y
  2. Search for them in previous versions
  3. return classifications for all previous versions (classifications or supertaxa functions)

Given the data I see on their website, I don't see a way to track explicit changes to taxa, especially to non-species taxa, since they dont seem to have the concept of a taxon ID that I have seen. Using Accession numbers of the type specimens, you could implicitly track changes to species without too much trouble (like the examples above). You can probably extend this concept to other ranks with the following logic:

  1. Get all accessions for genus X in version N1
  2. Find same accessions in version N2
  3. Find taxa the accessions are associated with.
  4. Are they all still in genus X? Then nothing changed.
  5. If all accessions in genus Y now:
    • If genus X does not exist in N2:
      • If Y does not exist in N1: Then X renamed to Y
      • If Y does exist in N1: Then X merged with Y
    • If genus X exists in N2: species in X reassigned to Y
  6. If some in genus Y and some in genus X: some species in X reassigned to Y
  7. .... etc etc

Thats how I would approach it with current tools.

Might also be able to track explicit changes to the taxonomy using a kind of "git style" diff record table. With one change per row and columns for what taxon it was, what it changed to, and which version the change was made in. Then you could reconstruct other versions by applying changes to the current version. That would be a smaller data structure, but a bit complicated and would require new tools inserting new taxa in the taxonomy (which we should have in taxa eventually anyway). However, I don't see this kind of info available for download on the website.

In summary, I think this is doable, but there are a lot of ways it could be approached. If there is more interest in taxonomies over time, we should consider making a new class for it.

zachary-foster commented 6 years ago

Oops did not mean to close

noamross commented 6 years ago

Thanks this helpful think-through, @zachary-foster! It looks like, internally, there is a stable itis_id that describes the same taxon across years, as well as a taxon_id that is unique for every node across years. So this can do the job that would otherwise require looking up the accession numbers. I'm working on pulling all this information out and I'll report back how I do using taxa to make sense of it.