Open sckott opened 6 years ago
Thanks @sckott. Some more info:
We're considering making a package to wrap data from ICTV, which defines viral taxonomy. This is a pretty volatile taxonomic area, and there are annual releases of the database with changes. So if a virus name is found in an old publication, it may not be in the current taxonomy. We'd like to be able to resolve it to it's place in the current taxonomy. So a question the package might answer is: "What is the current Genus of virus known in 2001 as X". Or, "What are all the previous names of virus(es) that are now known as Y?" I'm wondering if there's a common approach to dealing with this.
For instance, Here's an example of a convoluted history of a current species, which other species have been merged to, have been renamed, etc. https://talk.ictvonline.org/taxonomy/p/taxonomy-history?taxnode_id=20171861 .
For now, if we pursue this we might just start with allowing users to specify which release of the ICTV data to work with, and use the taxa framework to define relationships.
I'll probably try to get in touch with the person who ICTV to see how they handle it. They have some way to represent this in the database that powers the website, even through the data releases themselves sort of break the links.
Hi @noamross, this is an interesting problem. I can imagine a couple of solutions, depending on how flexible things need to be. The easiest I can imagine would be to concatenate all of the mater species lists:
https://talk.ictvonline.org/files/master-species-lists/
into a single table with an extra column indicating the version number and using parse_tax_data
like so (only one dataset shown):
> obj <- parse_tax_data(raw_data, class_cols = c("Order", "Family", "Genus", "Species"))
> obj
<Taxmap>
5299 taxa: aab. Bunyavirales, aac. Caudovirales, aad. Herpesvirales ... hvu. Pepper ringspot virus, hvv. Tobacco rattle virus
5299 edges: NA->aab, NA->aac, NA->aad, NA->aae, NA->aaf, NA->aag ... bik->hvq, bik->hvr, bik->hvs, bil->hvt, bil->hvu, bil->hvv
1 data sets:
tax_data:
# A tibble: 4,404 x 15
taxon_id Sort Order Family Subfamily Genus Species `Type Species?` `Exemplar Accession… `Exemplar Isolate` `Genome Composi…
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 bim 1.00 Bunya… Feravi… <NA> Ortho… Ferak ort… 1.00 L:KP710246,M:KP7102… ferak virus C51-C… ssRNA(-)
2 bin 2.00 Bunya… Fimovi… <NA> Emara… Actinidia… 0 RNA1:KT861481,RNA2:… Actinidia chlorot… ssRNA(-)
3 bio 3.00 Bunya… Fimovi… <NA> Emara… European … 1.00 RNA1: AY563040, RNA… Mielke ssRNA(-)
# ... with 4,401 more rows, and 4 more variables: `Last Change` <chr>, `MSL of Last Change` <dbl>, Proposal <chr>, `Taxon History
# URL` <chr>
0 functions:
One you do that, you should have a taxonomy with every taxon that ever existed. To get a specific version, you then just subset it by that extra column (not shown above); something like:
taxa_in_version <- obj$data$tax_data$taxon_id[obj$data$tax_data$db_version == "my_fav_version"]
filter_taxa(obj, taxon_ids %in% taxa_in_version)
That would subset both the taxonomy and the associated table to just that version. That doesn't explicitly track changes through time, but all the data is there in a manageable form at least and I think it could be used to answer your example questions:
What is the current Genus of virus known in 2001 as X
What are all the previous names of virus(es) that are now known as Y?
classifications
or supertaxa
functions) Given the data I see on their website, I don't see a way to track explicit changes to taxa, especially to non-species taxa, since they dont seem to have the concept of a taxon ID that I have seen. Using Accession numbers of the type specimens, you could implicitly track changes to species without too much trouble (like the examples above). You can probably extend this concept to other ranks with the following logic:
Thats how I would approach it with current tools.
Might also be able to track explicit changes to the taxonomy using a kind of "git style" diff record table. With one change per row and columns for what taxon it was, what it changed to, and which version the change was made in. Then you could reconstruct other versions by applying changes to the current version. That would be a smaller data structure, but a bit complicated and would require new tools inserting new taxa in the taxonomy (which we should have in taxa
eventually anyway). However, I don't see this kind of info available for download on the website.
In summary, I think this is doable, but there are a lot of ways it could be approached. If there is more interest in taxonomies over time, we should consider making a new class for it.
Oops did not mean to close
Thanks this helpful think-through, @zachary-foster! It looks like, internally, there is a stable itis_id
that describes the same taxon across years, as well as a taxon_id
that is unique for every node across years. So this can do the job that would otherwise require looking up the accession numbers. I'm working on pulling all this information out and I'll report back how I do using taxa to make sense of it.
question from @noamross
he asks if there is a standard way of representing changes to taxonomy: species being merged, split, added, deleted, etc, over time
@zachary-foster as far as i know we don't have anything like this in
taxa
, correct? Do you deal much with changes in names through time in taxa you work with?I don't think the darwin core standard deals with this, and not sure of any data standard like this out there. Of course one could just make something up if nothing exists