ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

more metadata use cases #21

Closed cboettig closed 6 years ago

cboettig commented 11 years ago

Many R-based tools need ultrametric / time-calibrated phylogenies. R also provides several tools to do this. A good use case for metadata reading and writing might be to work out what metadata we might add if we: read in an uncalibrated phylogeny, use a given function (and parameter choice potentially) in a given software to perform the time-calibration, and then write out the time-calibrated tree. For instance, we might annotate:

rvosa commented 11 years ago

It would be good to have annotations on the nodes that were 'anchored', e.g. what the basis of the date was (fossil?), whether it's a lower or upper limit or a range.

cboettig commented 11 years ago

Extending the to-do list based on semantic objectives proposed in Kseniia's project description, with some comments from me on implementation:

I think I will just map R's citation class object into prism metadata, as done in TreeBASE. Any preference for prism over Shotton's SPAR ontologies for this??

Native to NeXML already. We should just add a function that will use Scott's taxize package to get TSN identifiers for species names (e.g. when extracted from an ape::phylo$tip.label and add the identifier to the otu node metadata

Um, not so sure. Can someone point me to examples of NeXML files that have such annotations?

It seems like this would be most useful if we provided functions that could also operate on this data. For now, this data would be read in to R and could be displayed, but as it is not part of the ape or phylo4 classes, no function could do anything with it. Ideally I imagine providing a function that could "draw a tree" from the distribution implied by the branch uncertainty, providing an easy way for R programmers to integrate over this uncertainty using only existing tools. Also still need to figure out the best way to write annotations to branches. Currently requires knowledge of the S4 structure.

hlapp commented 11 years ago

Re: citation metadata, you might want to consider the BIBO vocabulary.

rvosa commented 11 years ago

Hi Carl,

  • to convey links from trees to associated publications; I think I will just map R's citation class object into prism metadata, as done in TreeBASE. Any preference for prism over Shotton's SPAR ontologies for this??

I don't care.

  • to convey links from terminal nodes (less importantly, internal nodes) to taxonomic identifiers (and other forms of alternative labeling); Native to NeXML already. We should just add a function that will use Scott's taxize package to get TSN identifiers for species names (e.g. when extracted from an ape::phylo$tip.label and add the identifier to the otu node metadata

Sounds great. I was a little worried at first when you said "terminal nodes" but you clarify later that you mean metadata attached to the otu element, which is probably the better place for taxonomic identifiers.

  • to convey reconciliation results (duplication, speciation, lateral transfer); Um, not so sure. Can someone point me to examples of NeXML files that have such annotations?

Gene duplication and speciation events are usually mapped onto trees using phyloxml or nhx (i.e. Chris Zmasek has developed this). In Bio::Phylo I've added the option of reading and writing phyloxml and translating it to nexml. The way I dealt with the events annotations was to make the terms as they are used in phyloxml into semantic annotations whose namespace is " http://www.phyloxml.org/1.10/terms#". I don't know if it's urgent to replicate this functionality in R, though.

  • to convey compound branch features such as lengths with uncertainties (a la DateLife), or multiple types of support values (bootstrap + posteriors). _It seems like this would be most useful if we provided functions that could also operate on this data. For now, this data would be read in to R and could be displayed, but as it is not part of the ape or phylo4 classes, no function could do anything with it. Ideally I imagine providing a function that could "draw a tree" from the distribution implied by the branch uncertainty, providing an easy way for R programmers to integrate over this uncertainty using only existing tools

My guess is that this might be the most important feature that R users might take out of this. They're going to want to do numerical things so if NeXML can offer them branch lengths (with intervals) and support values so they can easily rip through them across a large tree or a set of trees I think that would be great.

Secondly, by "draw a tree" I suppose you mean to simulate one (or one million) within the interval that is specified in the annotation (prettier still if that annotation also specifies what the underlying distribution is, I guess).

With my monday morning eyes I first thought you were talking about visualization - which would also be excellent. Is the current industry standard to somehow convince figtree to show node bars which you then poke at in illustrator? Anyway, visualization of NeXML annotations would be great too - though kind of a separate story altogether.

Also still need to figure out the best way to write annotations to branches. Currently requires knowledge of the S4 structure.

I have no good tips here. Other than for the Java API I haven't implemented edge objects with annotations attached to them.

Rutger