Interpreting comparative data in nexml files

cboettig commented 10 years ago

I'm intrigued by @rvosa's suggestion of showing how RNeXML can be used to document both the character data and phylogenetic data used in comparative phylogenetics, which accounts for many of the R package use cases.

However, there may be a bit of a cultural "stereotype" to overcome in pitching this use case. From my own interactions I'm under the impression that most researchers assume that any character data in a NeXML file is that which is coded for and used for phylogenetic inference of the tree below it. I am afraid researchers might be hesitant to write comparative trait data to a nexml file for fear of making it look like their beautiful tree was inferred from some tiny character data set, instead some big file of sequence data.

I'll ask some fellow practitioners about this, but perhaps there is something we can add metadata-wise to indicate that the phylogeny was/wasn't inferred from the character data provided?

Any thoughts on this?

rvosa commented 10 years ago

@hlapp, isn't this a job for the MIAPA ontology?

hlapp commented 10 years ago

Metadata are generally about positively stating or asserting facts, not the absence of them. We developed a provenance documentation recommendation at the 2nd Phylotastic Hackathon, using W3C's PROV and MIAPA. You could of course use OWL to assert that some instance that is not of type cdao:Tree prov:wasDerivedFrom the trait matrix. But it'd probably be more powerful to assert instead what exactly was derived from it. See issue #26.

bomeara commented 10 years ago

I think it's ok to just have the comparative data with the tree with no special need to note that the tree came from a different dataset. For example, one popular sample dataset in R is the geospiza one from Geiger: it has a tree, and various bird measurements, but I don't think anyone expects that the bird tree came from the included data.

cboettig commented 10 years ago

@bomeara Thanks, that's good to hear! Of course the geiger case, that data isn't being read in as a nexus file, so there is isn't the same assumption. Have you seen anyone read in or write out comparative trait data in nexus format? Does your group tend to store character data in xlsx/csv formats, or nexus, or something else?

I'd like to make the case that in comparative phylogenetics we should start publishing the relevant trait data along with the phylogenies in a single nexml file, as it would facilitate reproducibility, metadata annotation, and data exchange across different platforms and software. I don't think many comparative methods people are using nexus files for their trait data at the moment though (perhaps/hopefully I'm wrong), so wondering if this will seem confusing to people.

@hlapp excellent point about documenting where the tree did come from. Perhaps I can parse that down into some simple user commands for common cases, even if it captures only the general notion (e.g. used "MrBayes" vs "simulated bd tree in R") and not the whole provenance.

hlapp commented 10 years ago

@cboettig Could you perhaps also file an issue on the MIAPA ontology tracker (referring back to this issue) about needing a term indicating that a matrix is trait data for comparative analysis?

bomeara commented 10 years ago

Note that DNA data could be comparative trait data. For example, I could make a tree from the usual phylogenetic markers and then use it to reconstruct a venom gene sequence down the tree. I'd try to deal with this as simply as possible: metadata that a tree is made from COI, 28S, and ef1a and that the comparative traits are venom genes.

As far as comparative data formats, I think xls may be most common (sigh), followed by csv and nexus (perhaps Mesquite-flavored nexus: title, multiple taxa blocks, etc).

cboettig commented 10 years ago

@bomeara thanks! perhaps that low adoption is part due to problems with nexus parsers for character data in R? (e.g. problems I've run into parsing morphobank nexus files as mentioned in #42).

At least that is something we could overcome in having both tree and character in NeXML. For instance, a user with comparative trait data could serialize that data for easy exchange and archiving with this RNeXML package as it stands:

library(RNeXML)
library(geiger)
data(geospiza)

nexml <- add_trees(geospiza$phy)
nexml <- add_character_data(geospiza$dat)
write.nexml(nexml, "geospiza.xml")

Which generates geospiza.xml nexml. This would keep the traits and tree together in a single file (to which more annotations/metadata could easily be added) that one could deposit on Dryad etc.

cboettig commented 10 years ago

Think we're good here. Still need implementation for the rest of MIAPA ontology, #46

ropensci / RNeXML

Interpreting comparative data in nexml files #44