character matrix strategies

cboettig commented 11 years ago

NeXML, like nexus, can contain a lot of character matrix (e.g. sequence) data. Current approach, like the read.nexus functions, simply ignores this.

We will want to be able to read and manipulate R objects without having to carry around the weight of the character data. This can probably be controlled through the read_nexml top-level api, e.g. nexml_read("file.nexml", type="phylo") vs type = character_matrix.

We need to figure out what R object we want to coerce character matrices to, if any. Not familiar with many R functions that use sequence data, so learning what functions exist and what formats they expect would be a first step. Meanwhile, we will presumably just read it into our S4 object equivalent. Methods can always be added later.

Comparative methods, which dominate the phylogenetics R tools, have the notion of character data as well, but usually as phenotypic data that is not meant to be informative of the tree inference, has no notion of alignment, etc. It would seem strange to represent this data in the NeXML in the same way. @rvosa What is the best way to go about this? Presumably this is related to the phenoscape project, but I haven't looked at that. Advice / strategies welcome.

hlapp commented 11 years ago

On Aug 12, 2013, at 2:15 PM, Carl Boettiger wrote:

nexml_read("file.nexml", type="phylo") vs type = character_matrix.

Or simply "matrix"?

We need to figure out what R object we want to coerce character matrices to, if any.

There are packages for molecular matrices. The best way to obtain tips for the most suitable ones is probably to ask on r-sig-phylo. I have seen several ones mentioned there in the past.

For other matrices, either an R matrix (in that case, characters need to all be of the same type, either numeric or text), or R data frame. You could also ask on r-sig-phylo about favorable ways to handle that; there are lots of comparative trait analysis packages.

Comparative methods, which dominate the phylogenetics R tools, have the notion of character data as well, but usually as phenotypic data that is not meant to be informative of the tree inference, has no notion of alignment, etc.

Not sure what you mean here - surely trees are inferred from non-molecular character data?

rvosa commented 11 years ago

I'm guessing you mean that people might do comparative analyses in R such that you have already have a tree (which was based on, say, a molecular marker) and you then have some other, smaller, set of data for which you want to study correlation between traits.

To my mind it would be fine to use NeXML for both those data: the data from which the tree was inferred, the tree itself and data that you want to analyze comparatively can all live happily in the same document.

I guess for that second data set you might have more to say about it because it's only a couple of columns of continuous or multistate data that you have collected for that study (hence your reference to phenoscape) - but that's fine too, just annotate the as you see fit:

if you have something to say about a specific observation/measurement, annotate that matrix cell
if you have something to say about a character (e.g. what do I mean by "intercheliceral sclerite"), annotate the char element
if you have something to say about how you organized a character in states (e.g. what do I mean by "small", "medium" or "large"), annotate the state element(s)

cboettig commented 10 years ago

So taking another look at character matrix data. I don't have a clear use-case in mind so the translation is less obvious to me.

I see that we could easily map the <matrix> element into an R matrix, rows into rows, cells into cells, but I'm not quite sure on the particulars. For instance, how we handle all the different cell types (AbstractCell vs DnaCell, etc etc).

Also not sure if we would then de-reference (or whatever you call it) the states and otus, such that we have, say, species names as the row names and state names as the column names or whatever, rather than their reference ids numbers.

Again since I'm not as familiar with this data structure or R workflow it might be destined for, I'm not so clear on how to implement it.

rvosa commented 10 years ago

Can we come up with an interesting use case perhaps? Like a comparative analysis of a continuous and a discrete character? That would kind of guide the issue of managing the references between taxa and multiple character state matrices.

cboettig commented 10 years ago

Yeah, that sounds like an ideal use case for this context.

Would be great to have a simple example with a continuous trait and a discrete trait for each otu in the tree. Can you point me to an example nexml as a starting place? (Otherwise I can probably construct one -- just have to wrap my head around the nexml logic for character element appropriately, e.g. what nodes I need and how the inheritance works.

This inheritance tree is probably a good starting point for me: http://nexml.org/nexml/html/doc/schema-1/characters/continuous/inheritance/

Not sure how discrete characters fit into that inheritance structure. I guess I write the code against just the abstract classes anyway, as with the trees?

rvosa commented 10 years ago

I sent a pull request that adds an additional example file.

cboettig commented 10 years ago

@rvosa I'm a bit foggy on the inheritance rules for some of the characters schema, in particular I cannot figure out where <cell> nodes are defined. From where do they get their "char" and "state" attributes?

So far I've implemented nodes as follows; please highlight any mistakes:

characters, inherits TaxaLinked and contains one "format" node followed by one "matrix" node.
format, inherits "Annotated", contains a <states> node and 1 or more char nodes,
char, inherits "IDTagged" EDIT provides attribute "states". Or does it inherit them from someone else?
matrix, inherits "Annotated", contains 1 or more "row" nodes
row inherits "TaxonLinked", contains 0 or more cell nodes, 0 or more seq nodes (?)
states, inherits "IDTagged", contains one or more states,
state, inherits "IDTagged", includes the additional attribute "symbol",
uncertain_state_set, inherits "state", contains two or more member nodes (??)
polymorphic_state_set, inherits "state", contains two or more member nodes (??)
member, no idea. Guess it inherits the attribute "state" from whoever "cell" inherits "state" from?
cell, no idea. Perhaps it inherits base and provides the attributes "char" and "state"?
seq no idea. Appears not to have attributes? contains a text string?

EDIT: I think these are all the nodes I see in the example file. What additional definitions will we need to get started?

cboettig commented 10 years ago

Overall this issue has now been parceled out and addressed in #36, #37, #38, #39, #42, #44.

I'm still not quite sure I got the inheritance map worked out perfectly, (see comments above), but we seem to be able to handle valid character matrices for the most part now. Since the general strategy is in place through the add_characters, and get_characters functions, I think we can close this one. Details will be followed up in additional issues.

ropensci / RNeXML

character matrix strategies #12