ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

phyloXML: Separate R package or not? #165

Closed gvegayon closed 5 years ago

gvegayon commented 6 years ago

Following @hlapp's suggestion in #55, here is a separate issue for discussing whereas a separate package may be a good idea for providing support for the phyloXML format.

Here are @hlapp's kickoff questions:

  1. What would it take, or what should it mean to be API signature-compatible with the RNeXML package. I.e., clearly not every function/method signature would need to (or even could) be the same, but are there core functions users would use to obtain data for downstream use (such as APE matrices or tree objects for downstream comparative analysis) that could be the same for both RNeXML and rphyloXML.
  2. Probably somewhat dependent on the answer to the above, how much sense does it make to have phyloXML support directly within this (the RNeXML) package, or as a separate one. I.e., how much scope creep would having it as part of RNeXML introduce, versus how much code redundancy would separate packages cause.

As I mentioned earlier, I already started working on something, but it is only a few lines of code, so I have no problem on moving/rewriting stuff here (for example).

cc @rvosa @cboettig

gvegayon commented 6 years ago

I have a couple of basic questions:

  1. What's the most popular way to manage tree annotations in R? I'm thinking of something that, for example, uses matrix or list.
  2. Does RNeXML support cycles? I understand that (NeXML does)[https://github.com/nexml/nexml/wiki/NeXML-Manual#cyclical-graph-network], but I'm not sure how useful/worthy is when writing the parsers.
cboettig commented 6 years ago

@gvegayon A few quick thoughts:

In general I'm all for small packages doing one thing well, so a separate package makes sense to me.

The question of what RNeXML supports depends on what serialization you have in mind. e.g. RNeXML supports everything you can do in NeXML (e.g. cycles, meta annotations), but whether or not these can map into popular phylogenetics object formats used in R (e.g. ape tree objects) is really a question about ape trees, not RNeXML. Unfortunately, most of the native R formats are not tightly defined by a schema specification, so 'supporting annotations' really just means that different developers modify those structures in different ways that may or may not be compatible with anything else (just like stand-alone programs do to with the NEXUS format).

Since phyloXML also has a well-defined schema, it seems like it would be useful to just have an XSLT transform between phyloXML and NeXML that we could then use from any language. Does something like that exist?

Mapping into the R objects is tricky. RNeXML tries to implement such a mapping to the common R structures at the time of writing, but this is potentially fragile. Still, presumably if you could map phyloXML into NeXML, then you could use RNeXML to map into those formats.

We've recently begun work on an a mapping from NeXML into JSON-LD. One nice thing about this approach is that JSON corresponds very nicely with R list objects, everything is just nested key-value pairs, which makes it both a lot easier to add annotations and potentially easier for developers to work with in R than the clunky S4 representation used in RNeXML. This is very much in early/experimental phase, see: https://github.com/cboettig/nexld.

For instance, if you have a tree as a list object and want to add an annotation by just tacking it onto the list c(tree, list("prov:wasDerivedFrom" = "https://doi.org/doi_for_nexus_tree"). This mimics the way R developers tack on additional properties when they need them, but can simultaneously be serialized into valid JSON-LD and then framed into valid NeXML.

gvegayon commented 6 years ago

In general I'm all for small packages doing one thing well, so a separate package makes sense to me.

I agree. Still, having functions as phyloxml_to_nexml and vice versa would be a nice thing to have :smile: which is why having a XSLT transform seems to be a great idea! Although, let me confess that this is the first time that I read about it.

Unfortunately, most of the native R formats are not tightly defined by a schema specification, so 'supporting annotations' really just means that different developers modify those structures in different ways that may or may not be compatible with anything else (just like stand-alone programs do to with the NEXUS format).

I thought so, I'm asking this because I recently started a project in which I need to handle both annotations and tree structure (aphylo). It is hard to me to see why there's no standard for this. It seems to be that extending ape::phylo to be able to handle annotations as well would be a good idea. I think RNeXML would benefit greatly with something like that!

Since phyloXML also has a well-defined schema, it seems like it would be useful to just have an XSLT transform between phyloXML and NeXML that we could then use from any language. Does something like that exist?

An XSLT transform can be a separate project itself, right?. One problem though--again, having read about XSLT transform for the first time--is that phyloXML is similar to Newick as it defines structures using nested objects, would that be a problem?

We've recently begun work on an a mapping from NeXML into JSON-LD

That looks great! I do see the benefits of having S4 classes for complex objects like nexml, but I do prefer S3 classes overall. Working with the later does become a bit tricky as the object becomes more complex, but what I usually do is to have constructor and validation functions... but I guess that you've already thought about it!

cboettig commented 6 years ago

An XSLT transform can be a separate project itself, right?.

Right, though an R package could use said stylesheet to do the mapping. (e.g. RNeXML uses an existing XSLT stylesheet to turn NeXML into RDF). @rvosa I don't suppose you have an XSLT stylesheet for phyloXML?

Re the nesting, I dunno, I think in principle that is fine, though it may make coding the XSLT sheet a pain and a half. Note that this is one of the nice features of the JSON-LD approach: you can transform between these different representations (nested, like phyloXML, or flat reference list, like NeXML) using JSON-LD frames. That's the great thing about separating the semantic meaning of the data from the structure.

rvosa commented 6 years ago

An XSLT transform can be a separate project itself, right?.

Right, though an R package could use said stylesheet to do the mapping. (e.g. RNeXML uses an existing XSLT stylesheet to turn NeXML into RDF). @rvosa https://github.com/rvosa I don't suppose you have an XSLT stylesheet for phyloXML?

Nope.

Re the nesting, I dunno, I think in principle that is fine, though it may make coding the XSLT sheet a pain and a half.

I think it would be a huge pain. You'd have to generate identifiers to link edges to node when generating NeXML and conversely you'd have to query NeXML using identifiers to generate PhyloXML.

Also, the PhyloXML annotation predicates don't have a namespace, so you'd have to invent that as well. This is fixable (for example, one might use the phyloxml.org URI for this), but I'd hate to have to do this in XSLT.

gvegayon commented 6 years ago

mmm, it seems to me that perhaps the easiest way to map both is to write coercion functions using R basic structures may be. We'll need to keep some sort of dictionary to map the two classes, but I think this is the easiest (or at least, fastest) way to approach this. What do you think?

cboettig commented 6 years ago

sounds good to me

cboettig commented 5 years ago

Closing old issue as it sounds like consensus is that this would be outside the scope of RNeXML.

Just a complete aside, but switching between a nested convention like PhyloXML and a unnested convention like NeXML is super simple if we represent the XML as JSON-LD and write a JSON frame. Examples of this in https://github.com/cboettig/nexld, still work in progress (not trying to solve the PhyloXML issue in particular, but just because re-nesting and unnesting are is often much easier / faster than resolving reference nodes separately....