nexmlTree class when as(obj, "phylo") is a list

ropensci / RNeXML

Implementing semantically rich NeXML I/O in R

https://docs.ropensci.org/RNeXML

Other

13 stars 9 forks source link

nexmlTree class when as(obj, "phylo") is a list #27

Closed cboettig closed 10 years ago

cboettig commented 11 years ago

nexml can contain a single <tree> node inside the <trees> node, multiple <tree> nodes in a single <trees> node, or even multiple <trees> nodes. The first two cases map naturally onto a "phylo" object and a "multiPhylo" object, defined as classes in ape. In order to preserve the associations, we map the third case (multiple <trees> nodes) to a list of multiPhylo objects, which isn't something immediately reconizable as a phylogenetic tree class....

Consequently, we make this mapping automatically when asked to coerce a nexml file into a phylo, even though this means technically not returning the requested class (ask for phylo and get multiPhylo). We should probably handle that differently.

For instance, this current creates a problem when coercing a nexml file with multiple <trees> nodes into a nexmlTree class (because this includes a conversion to phylo. Because nexmlTree is the defaul read-in format now, this can cause reading in of such nexml files to fail.

Obviously it would be nice for a user to convert nexml to the ape classes without having to know how many trees are in the nexml file. Need to figure out the best way to do this.

rvosa commented 11 years ago

So the interface is necessarily that phylo/multiPhylo is returned, or could it, for consistency's sake, be a vector with zero or more multiPhylo's, always?

On Wed, Oct 23, 2013 at 2:18 AM, Carl Boettiger notifications@github.comwrote:

nexml can contain a single node inside the node, multiple
nodes in a single node, or even multiple nodes. The first two cases map naturally onto a "phylo" object and a "multiPhylo" object, defined as classes in ape. In order to preserve the associations, we map the third case (multiple nodes) to a list of multiPhylo objects, which isn't something immediately reconizable as a phylogenetic tree class.... Consequently, we make this mapping automatically when asked to coerce a nexml file into a phylo, even though this means technically not returning the requested class (ask for phylo and get multiPhylo). We should probably handle that differently. For instance, this current creates a problem when coercing a nexml file with multiple nodes into a nexmlTree class (because this includes a conversion to phylo. Because nexmlTree is the defaul read-in format now, this can cause reading in of such nexml files to fail. Obviously it would be nice for a user to convert nexml to the ape classes without having to know how many trees are in the nexml file. Need to figure out the best way to do this. — Reply to this email directly or view it on GitHubhttps://github.com/ropensci/RNeXML/issues/27 .

Dr. Rutger A. Vos Bioinformaticist Naturalis Biodiversity Center Visiting address: Office A109, Einsteinweg 2, 2333 CC, Leiden, the Netherlands Mailing address: Postbus 9517, 2300 RA, Leiden, the Netherlands http://rutgervos.blogspot.com

cboettig commented 11 years ago

@rvosa Right, we could always return a list vector with one or more multiPhylos, but this makes the interface quite cumbersome. Instead of being able to do things like:

tree <- read.nexml("S100.xml")
plot(tree)
bd.time(tree, 0,1)

You would have to explicitly subset twice (once to escape the list of trees, and again to get the phylo from the multiPhylo:

tree <- read.nexml("S100.xml")
plot(tree[[1]][[1]])
bd.time(tree[[1]][[1]], 0,1)

which will definitely trip up a user who knows that there is only one tree in the nexml. Partly this is a problem with ape, since multiPhylo doesn't inherit the methods of phylo, making the second set of [[ necessary (and making it is a relatively useless class). Adding the extra layer of a list of multiphylos makes the problem worse.

I don't like having functions whose return type changes based on internals of the data input, e.g. currently read.nexml("S100.xml", "phylo") might return a list, a multiPhylo, or a phylo object depending the how the nexml is structured, which is clearly bad programming.

Not sure what the right solution is here.

In contrast to ape, the R base class for bibliographic entries, bibentry, can be either a single entry or a list of multiple bibentry objects, such that the associated methods work in the same way whether the object has one or multiple entries:

a = citation("ouch") # contains citations to 2 papers
format(a, "text")      # formats both 
format(a[[1]], "text") # same call formats just one

cboettig commented 11 years ago

Ah, so part of the problem is in order to get our nexmlTree class to act like the S3 phylo class, we explicitly define it as a phylo, not a multiPhylo. This is kinda a fundamental problem introduced by trying to have an object behave both as a phylo, representing at most 1 tree, and a nexml object, representing 0 or more trees. Might have to rethink this class a bit...

rvosa commented 10 years ago

I don't know if this is helpful or not, but when I was "designing" Bio::Phylo I made it do what I thought was the obvious thing based on the input data. E.g. if an input file is newick format, return a tree block. If phylip, a matrix. And so on.

It turned out that this was a terrible idea, especially because Bio::Phylo's reading interface accepts a fairly wide variety of data formats that can have various permutations of zero or more trees and tree blocks, character matrices and/or taxa.

So later on I added two features: 1) a "project" object that is a container for all the contents in a file, so I always get the same container back provided I pass in an extra argument that asks for this; 2) special 'parse_matrix' and 'parse_tree' functions that explicitly return the one matrix or tree that is expected to be contained in the input file.

It turns out that I use the option to read in the contents as a "project" probably 75% of the time, and the parse_matrix and parse_tree functions the other 25%. This is probably because the "project" object (also roughly equivalent to a NeXML infoset) has a special method "get_items" that you can pass in class identifiers (named constants) so you can say project.get_items(TREE) and get a flat list of all the trees, regardless whether they were from multiple tree blocks in a nexus file, a single ToLWeb tree, or whatever.

This was a fairly functional solution. My original design I nearly never use. Let this be a warning ;-)

cboettig commented 10 years ago

@rvosa definitely helpful case to think about. I guess the analogy is that we really don't want a functional that returns a "phylo" sometimes (e.g. when there is one tree) and a "multiphylo" other times, depending on how many trees are in the given input file. Your experience seems to suggest to me that our nexml_read function should always return one of our S4 nexml objects, which corresponds directly to the schema definition, and then provide methods such as:

nex <- nexml_read("nexml_data.xml")
get_items(nex, "tree") # return a flat list of phylo objects (a single multiphylo), 
get_items(nex, "trees") # return a list of multiphylo objects (keeps multiple `<trees>` elements separate)
get_tree(nex) # an alias for `get_items(nex, "tree")`
get_items(nex, "metadata")   # get top-level metadata
get_items(nex, "metadata", "otu")   # metadata that is child element to any otu nodes

What do you think of such an interface?

It is slightly annoying that we could no longer do something like:

nex <- nexml_read("nexml_data.xml")
plot(nex) # or some other method defined only for ape::phylo

I think the fact that the phylobase::phylo4 object could not be used in functions that took an ape::phylo object as an argument was a barrier to it's adoption. Then again, the advantage of RNeXML is the external XML format, which phylobase didn't have. Internally, I suspect most users will continue to use ape trees so I suppose it is okay if the workflow is always nexml_read followed by a coercion into a phylo format...

cboettig commented 10 years ago

To take this one step further: perhaps get_tree(nex) should take an optional argument as to what format should be returned? I suspect 99% of users will be trying to get an ape tree out of the thing, but no reason why we couldn't support returning the S4 tree structure (metadata and all), or other existing formats such as ouch, as options...

cboettig commented 10 years ago

Okay, for the moment I'm depricating the nexmlTree class that acts a phylo object and a nexml object, because it doesn't also act as a multiPhylo object (and basically tries to be too clever and is therefore confusing).

We now have a get_item based interface (with aliases such as get_tree as well). There are also coercion methods, e.g. as(nexml, "multiphylo"), so a user can write whatever they find most intuitive. I settled for options for tree, trees, and flat_trees, since I wasn't comfortable discarding the heirarchical trees structure without an explicit request to do so. get_trees returns a multiphylo of multiphylos, always. get_item(nexml, "flat_trees") returns a flattened multiphylo.

rvosa commented 10 years ago

Yup, this sounds nice and consistent.