rhagenson / bio-tool-requests

Requests for Bioinformatics tools people want, but nobody has built
MIT License
0 stars 0 forks source link

Visualize Shifting Taxonomy Over Time #4

Open rhagenson opened 4 years ago

rhagenson commented 4 years ago

Input

Currently thinking the input will be some sort of unequal CSV style line-based input sort of like:

keyword year ... ...
describe 1795 Microcebus murinus
describe 1795 Microcebus rufus
split 1994 Microcebus murinus Microcebus myoxinus
describe 2008 Microcebus arnholdi
describe 2008 Microcebus margotmarshae
synonym 2006 Microcebus mamiratra Microcebus lokobensis

The alternative "square" form I see is very non-standard and is to use an arrow syntax:

describe: 1795->Microcebus murinus
describe: 1795->Microcebus rufus
split: 1994->Microcebus murinus->Microcebus myoxinus
describe: 2008->Microcebus arnholdi
describe: 2008->Microcebus margotmarshae
synonym: 1795->Microcebus mamiratra->Microcebus lokobensis

Descriptions are unary, while splits and synonymizations are binary so a format blurring the two is nice, but I think the more standard format is the better option as ot will integrate with other tools more easily.

Output

Initial idea is to build a Sankey diagram:

Sankey Example

This would break the complete history into the years of events (time moving rightward on the diagram) and the assumption would be made for visual immediacy that a two-way split means 50/50 split, a three-way split is 33/33/33, etc. The reverse for synonym (50/50 to 1). A future addition could be estimated population sizes at time points, but then normalizing the numbers across years would be necessary to retain at least some consistency visually (a population decrease is possible, but we would want to prevent the diagram looking like a sudden population crash or surge as this is not the plot for that purpose). The starting block would just be labeled with the Genus and the first year block would be first described species.

Rejection/Termination Conditions

Only ones I can think of currently is rejection for a split and synonym occurring in the same year involving the same species, or a describe event occurring the same year as a split or synonym involving the described species.

Similar tools

None that I could find, hence the tool request.

Description

It shows the taxonomy within a Genus over time. It does not show complete taxonomy at a higher level over time or otherwise suggests new taxonomy. Although I could see some benefit to incorporating a phylogenetic species concept designation for splits and synonyms to show any inconsistencies there -- i.e. a split was called at 10% genetic difference, but a synonym was called at 12% genetic difference, in reverse these are fine, but here the cutoff for how diverged a species must be for description is unclear.

Research Purpose

It visualizes an element of how we discuss species in an easily digested, high-information transfer manner. We could see quite readily how many species did we consider at a particular year, when new species where discovered, etc.

rhagenson commented 4 years ago

Subspecies should be handled differently from species, both visually and methodologically. A "split" into subspecies is not the same as a split to species, however we must still allow the promotion from subspecies to species. This makes split-synonym or synonym-split unbalanced operations in some cases so split and synonym would no longer be functional inverses.

rhagenson commented 4 years ago

Subspecies should be handled differently from species, both visually and methodologically.

If I assume (hopefully safely) that genus-species-subspecies is always exactly three words separated by whitespace then I could add this functionality without any unique requirements on the input.

rhagenson commented 4 years ago

The input should require the source for the split/synonym/describe. I see two ways of doing this:

  1. Replace year with source so 2006 becomes Zaramody et al. (2006) -- this option would require parsing out the year
  2. Add a new column source -- this option allows the year of action and source publication year to be different

Either way, parsing citations is going to drastically increase the parsing logic so my current vote is to add the new column source and just paste it wholly into the final product. I see the source being a tooltip when hovering over a line or a citation list broken down by year.

rhagenson commented 4 years ago

A stretch feature that would sort of "future proof" this tool while also drastically increasing its usefulness would be to embed the input used as invisible data in the output. Do not currently know how many or if any potential output formats would support this feature, but I would like to have effectively a loop where a previous plot can be used alongside new input and the union of the two generates a new plot. This would allow the non-publication of input data, but the embedding of output plot as a figure to still be useful for downstream work.

rhagenson commented 4 years ago

A possible route of implementation is via R + D3 via RStudio's r2d3 library.

rhagenson commented 4 years ago

Multiple keyword calls should produce an error as a species should only have one official description/split/synonym year. Multiple sources for the same event should not really occur, but it might be that multiple references exist which could be used as the citation for an event. I think there is a definite value-add to error out rather than make a semi-arbitrary decision for the user which reference to use in the end.

ThatLionLady commented 4 years ago

This might be helpful. It has taxonomic information as well as graphics that could be used for bringing in information.

http://phylopic.org/

rhagenson commented 4 years ago

This might be helpful. It has taxonomic information as well as graphics that could be used for bringing in information.

http://phylopic.org/

PhyloPic appears to have good freely available silhouette images for use in phylogenetic trees which might be useful. The taxonomic information seems to be current rather than historical so it does not capture that necessary information. To supplement the taxonomic history within a genus I am investigating the use of Integrated Taxonomic Information System

rhagenson commented 4 years ago

The Encyclopedia of Life, which is taxonimically backed by Integrated Taxonomic Information System, is another potential resource for information as it is a data aggregation.