How would we express a simmap tree in NeXML?

cboettig commented 10 years ago

Tree "paintings" indicating different evolutionary regimes/modes along different parts of a phylogeny are a particularly common use case for comparative phylogenetics in R.

The current 'standard' seems to be simmap's representation, e.g. see Liam's description: http://blog.phytools.org/2010/12/reading-simmap-trees-into-r.html

Presumably this would be done with meta annotations on each edge indicating the state and length of time in that state along each branch?

Any ideas or existing examples?

rvosa commented 10 years ago

I was not away of Yet Another Newick Hack but there you have it. I guess you could either do this as multiple annotations on the "edge" element, or by introducing an unbranched internal node for each stretch of mapped character state.

cboettig commented 10 years ago

Note: this discussion continues on the nexml-discuss, where it might reach a broader audience.

Sounds like there's a good case for not modifying the topology. Meanwhile, notes to myself on how this might be done, with some outstanding questions to resolve, based on the discussion on the listserve.

Perhaps an edge that changed from state 1 to state 2 might be annotated as:

<states>
  <state id="s1" label="description of state">
  <state id="s2" label="description of alternative state">
  ...
  <edge id="e1" source="n1" target="n2" about="#e1" length="6.4">
    <meta property="x:order" content="1"  xsi:type="nex:LiteralMeta" id = "m1" about="#m1">
      <meta property="x:length" content="3.4" xsi:type="nex:LiteralMeta" id = "m2" />
      <meta property="x:hasState" content="s1" xsi:type="nex:LiteralMeta" id = "m3" />
    </meta>
   <meta property="x:order" content="2"  xsi:type="nex:LiteralMeta" id = "m4"  about="#m4">
      <meta property="x:length" content="3.0" xsi:type="nex:LiteralMeta" id = "m5"/>
      <meta property="x:hasState" content="s2" xsi:type="nex:LiteralMeta" id = "m6"/>
    </meta>
  </edge>

[ ] Obviously I have just made up the properties x:. How would I go about establishing a formal namespace for such properties?
[ ] Clearly several alternative annotations could be proposed here. For instance, could either declare start, stop, and state times for each section, instead?
[ ] Also, not sure if my nesting of meta elements is appropriate. Perhaps they should all be wrapped in a meta declaring something like hasStochasticCharacterMapping.

cboettig commented 10 years ago

Cool, I really like @mtholder's suggestion on the mailing list; it more clearly reflects the logic of NeXML elements and helps me think about how (a typical user) would (most sensibly) extend nexml (as compared to hacking the newick format yet again).

Perhaps this use-case might be a nice one to illustrate in the manuscript as an example of how an R user might go about defining a meaningful extension to nexml?

cboettig commented 10 years ago

<characters id="m1">
  <format>
    <states id="ss1">
      <state id="s1"/>
      <state id="s2"/>
   </states>
   <char id="cr1" states="ss1" label="reef-dwelling"/>
</characters>
...
<tree>
  ...
  <edge id="e1" source="n1" target="n2">
    <meta>
      <simmap:reconstructions> 
        <simmap:reconstruction character="cr1">
           <simmap:stateChange id="sc1" length="0.4" state="s2"/>
           <simmap:stateChange id="sc2" length="0.5" state="s1"/>
       </simmap:reconstruction>
     </simmap:reconstructions>

A few minor changes from Mark's (@mtholder) suggestion. Mark annotates the node state, but it seems strange to do this in a meta element child to an edge. I've also dropped the attribute edge = e1 from the simmap:stateChange node, since this annotation is a child of the edge element e1 already -- perhaps I should keep it anyway? (When phenoscape annotates a state element with a meta doesn't appear to explicitly reference the state id)

One limitation is that this format doesn't explicitly state the order in which the changes occur. The order is of course implicit in the ordering of the stateChange elements, but I believe that's not quite consistent with NeXML design principles (e.g. that data should be explicit, not encoded in structure)? Happy for more feedback on this.

cboettig commented 10 years ago

Given @hlapp and @rvosa's comments in issue #23, we probably want to consider using a meta based format to define the simmap representation.

I think this is the straight-forward translation into RDFa meta based on the XML-based description I have above:

    <edge id="e1" source="n1" target="n2" length="0.9">
      <meta property="simmap:reconstructions" id = m1>
        <meta property="simmap:reconstruction" id = m2>
          <meta property="nex:char" content = "cr1" id = m3>
            <meta property="simmap:stateChange" id = m4>
              <meta property = nex:length" content="0.4">   
              <meta property = "nex:state" content = "s2"/>
             </meta>
           </meta>
           <meta property="simmap:stateChange">
             <meta property = nex:length" content="0.5">
             <meta property = "nex:state" content = "s1"/>
            </meta>
          </meta>
        </meta>
      </meta>
    </edge>

(would have ids and xsi-types on all meta elements)

Note that I've claimed that state, length and char properties are defined in the nexml namespace, but probably that's not kosher? How should that be done properly?

Naturally this extension would have to come with a definition of new terms. If I understand correctly, while ideally that would be an OWL ontology, it would be permissible just to have a plain text definition like:

simmap definitions

simmap:reconstruction : A mapping of a character state onto an edge of the phylogeny. The state may change along the length of the edge, as indicated by the stateChange child element.
simmap:stateChange: An element indicating the character state given by the reconstruction and the duration (length) the edge was in this state. stateChange elements are given sequentially in the order or the state changes from root to tip. (Note that stochastic character mapping is not well-defined for an unrooted tree.) The sum of all lengths in a reconstruction of an edge should equal the length of the edge itself.
....

(add more text for additional attributes, explanatory diagram)

Is it poor form that I use the order of meta elements to indicate the order of the state changes?

cboettig commented 10 years ago

@hlapp @rvosa Can a meta element have both child nodes and a content value? (e.g. my meta property="nex:char" element in the RDFa version above? If not, not sure how to do this to avoid re-listing the character id every time I list the state id.

hlapp commented 10 years ago

Can a meta element have both child nodes and a content value? (e.g. my meta property="nex:char" element in the RDFa version above?

The schema doesn't seem to prohibit it, but the documentation says no:

Metadata annotations in which the object is a literal value. If the @content attribute is used, then the element should contain no children.

I'm not following yet why you have to have this. Can you perhaps give an example, such as what you think you'd be forced to do but don't want to?

hlapp commented 10 years ago

Is it poor form that I use the order of meta elements to indicate the order of the state changes?

Yes. Wouldn't it be possible to add a seq or ordering or other property to indicate order? Or use the same mechanism that NeXML uses for ordering characters, but now that I think about it I'm not sure how it does that.

rvosa commented 10 years ago

Characters are sort of ordered. They have id attributes, which datum cells then reference - so in that context they are actually unordered in the sense that the location of the char element among its siblings is meaningless. But, if there are no datum cells (i.e. with compact seq elements) the convention is that the order in which tokens appear in the seq corresponds with the order in which char elements are defined. Note that char elements can also, optionally, have an integer attribute to specify codon position. Maybe you can take this as precedent for an integer property to store order?

cboettig commented 10 years ago

@hlapp @rvosa Thanks for the feedback. Yeah, would like this example to be solid as possible if we're to use it as an exemplar how-to. Here's an example of the current version:

 <edge id="e1" source="n1" target="n2" length="0.9">
      <meta property="simmap:reconstructions" id = "m1">
        <meta property="simmap:reconstruction" id = "m2">
          <meta property="nex:char" content = "cr1"/>
          <meta property="simmap:stateChange" id = "m4">
            <meta property="simmap:order" content = "1"/>
            <meta property = nex:length" content="0.4"/>   
            <meta property = "nex:state" content = "s2"/>
           </meta>
           <meta property="simmap:stateChange">
             <meta property="simmap:order" content = "2"/>
             <meta property = nex:length" content="0.5"/>
             <meta property = "nex:state" content = "s1"/>
           </meta>
         </meta>
      </meta>
    </edge>

(namespace definitions, id and about tags would be added automatically too, just omitted above).

I've explicitly added the property simmap:order to indicate the ordering of the state changes explicitly (and not rely on the ordering of the elements). I've also moved the nex:char property to be sister rather than parent to simmap:stateChange. My thinking is that nex:char is annotating simmap:reconstruction, stating that this particular reconstruction is a reconstruction of the given character. I think this fixes most of my concerns. I have also added code that converts this to/from the simmap format used by the phytools R package.

@hlapp @rvosa one outstanding concern I have is if I'm okay using nex:length, nex:char, and nex:state as I do above, rather than defining new terms for these explicitly in the simmap context. I'm not sure if they are semantically identical concepts or not, e.g. simmap:length is the length of time an edge spends in a particular state, while nex:length is the length of an <edge>.

rvosa commented 10 years ago

Would it be an idea to try to generate the output using the API for nested semantic annotation and just look at what that looks like? I am also (as you are) doubtful whether it is a good idea to re-use the nex:* names for ever-so-slightly different concepts.

On Thu, Jan 16, 2014 at 11:59 PM, Carl Boettiger notifications@github.comwrote:

@hlapp https://github.com/hlapp @rvosa https://github.com/rvosaThanks for the feedback. Yeah, would like this example to be solid as possible if we're to use it as an exemplar how-to. Here's an example of the current version:

``` ``` ```



(namespace definitions, id and about tags would be added automatically
too, just omitted above).

I've explicitly added the property simmap:order to indicate the ordering
of the state changes explicitly (and not rely on the ordering of the
elements). I've also moved the nex:char property to be sister rather than
parent to simmap:stateChange. My thinking is that nex:char is annotating
simmap:reconstruction, stating that this particular reconstruction is a
reconstruction of the given character. I think this fixes most of my
concerns. I have also added code that converts this to/from the simmap
format used by the phytools R package.

@hlapp https://github.com/hlapp @rvosa https://github.com/rvosa one
outstanding concern I have is if I'm okay using nex:length, nex:char, and
nex:state as I do above, rather than defining new terms for these
explicitly in the simmap context. I'm not sure if they are semantically
identical concepts or not, e.g. simmap:length is the length of time an
edge spends in a particular state, while nex:length is the length of an
<edge>.

—
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/RNeXML/issues/48#issuecomment-32556509
.

hlapp commented 10 years ago

one outstanding concern I have is if I'm okay using nex:length, nex:char, and nex:state as I do above, rather than defining new terms for these explicitly in the simmap context.

I think that's a bad idea. Not only as you see is the semantic match not clear, but there also is no nex vocabulary. It's a schema, and XML Schema per se actually don't have semantics.

cboettig commented 10 years ago

Okay, I've implemented my go at writing a simmap extension to NeXML along the lines we describe in this thread as an illustration of how RNeXML users can use the package to construct such extensions (rather than continuing to hack Newick formats as illustrated at the top of this thread). Could really use some critique from @rvosa and @hlapp on my stab at this, particularly with regards to defining a simmap namespace. I'm hoping to create an example to be something other users could reasonably do themselves without expertise in RDFa or XML, but also to be a good model case that doesn't cut corners.

You can see my attempt at explaining this implementation in this section of the manuscript: https://github.com/ropensci/RNeXML/blob/devel/inst/doc/pubs/manuscript.md#extending-the-nexml-standard-through-metadata-annotation

Obviously in addition to refining the implementation, it would be good to improve the explanation as well. (Overall not sure how much of that I will have space for in the manuscript body and what will be left to a supplement, vignette, and/or blog post, but for now not worrying about space.) @sckott would be great to get your feedback on this as well from the practical R perspective more than the valid nexml perspective.

rvosa commented 10 years ago

I had a look at it and I think it's pretty good. I find the syntax (line 730...) palatable enough, in any case. As regards telling people to create something at a URL that the namespace points to and defining their predicates there: that's nice advice (though technically nothing will break if they don't do that). Doing it in plain text is probably the best we can expect, people certainly aren't going to fire up protege to define a couple of predicates in their own research.

On Mon, Mar 24, 2014 at 11:32 PM, Carl Boettiger notifications@github.comwrote:

Okay, I've implemented my go at writing a simmap extension to NeXML along the lines we describe in this thread as an illustration of how RNeXMLusers can use the package to construct such extensions (rather than continuing to hack Newick formats as illustrated at the top of this thread). Could really use some critique from @rvosahttps://github.com/rvosaand @hlapp https://github.com/hlapp on my stab at this, particularly with regards to defining a simmap namespace. I'm hoping to create an example to be something other users could reasonably do themselves without expertise in RDFa or XML, but also to be a good model case that doesn't cut corners.

You can see my attempt at explaining this implementation in this section of the manuscript: https://github.com/ropensci/RNeXML/blob/devel/inst/doc/pubs/manuscript.md#extending-the-nexml-standard-through-metadata-annotation

Obviously in addition to refining the implementation, it would be good to improve the explanation as well. (Overall not sure how much of that I will have space for in the manuscript body and what will be left to a supplement, vignette, and/or blog post, but for now not worrying about space.) @sckott https://github.com/sckott would be great to get your feedback on this as well from the practical R perspective more than the valid nexml perspective.

Reply to this email directly or view it on GitHubhttps://github.com/ropensci/RNeXML/issues/48#issuecomment-38509874 .

cboettig commented 10 years ago

@rvosa Cool, thanks for the feedback. We'll probably need to keep working on the manuscript discussion of this as we get down the road. As it sounds like we have at least some acceptable basics for a simmap extension, I think I'll close this issue for now, but feel free to re-open.

ropensci / RNeXML

How would we express a simmap tree in NeXML? #48

simmap definitions