General purpose accessor functions for nexml object inspection

hlapp commented 5 years ago

As has come up within RPhenoscape (see phenoscape/rphenoscape#64) how to inspect the nexml object for some basic properties, such as number of taxa, number of characters, etc, can be difficult for novice (and arguably even for more advanced) users.

I added documentation to the Rphenoscape vignette (phenoscape/rphenoscape#67), but am wondering whether it wouldn't be broadly useful to have a set of basic methods, such as the following:

ntaxa(nex, block = 1)
ncharacters(nex, block = 1)
nmatrices(nex, block = 1)
nstates(nex, character)

hlapp commented 5 years ago

To be clear, I'd be willing to give this a start if greenlighted.

cboettig commented 5 years ago

This is a really good question.

To me, it is indicative of a deeper issue that perhaps we don't have the most intuitive data structures or we haven't communicated them well. In general I don't think creating custom methods to return things like the number of elements is really ideal, since it requires the user to learn our idiosyncratic way of asking about something rather than leveraging existing knowledge about standard R objects.

I think these might be natural things to put into a summary() method for the nexml class. (Some of them are already computed for the print() method but we don't have a summary method). Then you could do something like:

s <- summary(nexml)
s$ntaxa

or maybe an named vector of lengths is better, e.g. s$n["taxa"], s$n["matrices"]?

Regarding the data structure, I think this is partly because anything that uses the S4 slot notation gets confusing to users. My intuition is that the natural way to count taxa, characters, etc, would be to do a get_* method first and then inspect the object:

nexml <- read.nexml(system.file("examples", "comp_analysis.xml", package="RNeXML"))
char <- get_characters(nexml)
taxa <- get_taxa(nexml)

Because these are just data frames I think it will be obvious what to do, (e.g. dim()) though maybe it will be confusing that ncharacters is the number of columns returned by get_characters(), while ntaxa is the number of rows from get_taxa()? This also makes it natural to do things like count discrete states, e.g.

library(dplyr)
char %>% count(`reef-dwelling`)

thoughts?

hlapp commented 5 years ago

My intuition is that the natural way to count taxa, characters, etc, would be to do a get_* method first and then inspect the object:

This works only if the object meets expectations (of get_characters() etc). For example, if there are zero taxa or zero characters, these functions will error out. Also, they take unnecessary time if, for example, all I wanted to ask is how many characters a nexml object has, which can, and thus should be answered in milliseconds.

hlapp commented 5 years ago

As for the summary() way of doing things, that's not too bad but still kind of roundabout from a user's perspective, because my brain has to convert a question of how many characters into what is the summary object followed by what does the $ncharacters element of the summary object say.

For example, to obtain the number of rows and columns in a matrix m, one doesn't say

s <- summary(m)
s$nrows

Instead, it's simply nrows(m).

hlapp commented 5 years ago

Or maybe just make nrows() and ncols() into generics?

cboettig commented 5 years ago

Excellent point, I think that throwing an error under those circumstances is a bug. I think get_ methods should just return an empty data.frame if there's zero elements. What do you think?

Regarding performance, yeah, I'm with you on that. An internal method like summary should definitely just do what is fastest, but a user should do whatever is most obvious. In general we're a long way from optimized performance.

Yeah, summary isn't ideal, though it builds on an existing method; just like your approach leverages the length() function. base R already defines the functions nrow()and ncol(). I was debating whether defining something like a generic n() might also work:

taxa <- get_taxa()
n(taxa)

but not sure that's ideal (and also assumes a call to get_, though I suppose you could do: n(nexml, "taxa"). I'm divided on the whole question between additional arguments or additional functions...

hlapp commented 5 years ago

a user should do whatever is most obvious

All I can say is that our users ask the question how many characters in this nexml object. They're not asking about a summary, or about the character matrix, thinking that might allow them to ask how many characters.

I guess another way of saying the same thing is I'm wondering whether we're making this more complicated than it is. The users I interact with are asking a simple question. The question can be answered trivially and in milliseconds if they knew the structure of the nexml object. Why not provide a function that answers their question, does it in the fastest way possible, is trivial to implement robustly, and makes it robust to possible future changes of the nexml object structure.

cboettig commented 5 years ago

Yeah, I hear you. My intuition is that if you want to know how many characters, you get_characters() and look at it. In my experience when a package adds a new function for each task a user might want to do, you end up with loads of functions and the user simply doesn't know they all exist, so I'd prefer a compact namespace with some well-behaved functions.

The main strikes against this approach are that (a) get_characters() returns an error if there are no characters, and maybe (b) get_characters() is slower than it probably needs to be. It's good that this issue brought those problems to light, and it would probably be better to solve them than work around it?

hlapp commented 5 years ago

(a) is indeed a problem, but it could arguably also be fixed by declaring that a non-empty character matrix is a precondition for successful return. Most code using the result will have to do something else anyway for an empty matrix, so it doesn't seem an onerous precondition to me, if the way to learn about the number of characters weren't to first obtain the matrix.

(b) is arguably worth attention, but there is no way on earth that returning a large character matrix is ever going to be only negligibly slower than obtaining the length of the list of the respective elements in the nexml object.

That your intuition for learning about the number of characters is to get the character matrix and then count its columns is perhaps influenced by your view of the nexml object as an assembly of different types of data (characters, otus, phylogenies, blocks of each of these, etc). From the point of view of most RPhenoscape use cases, it is quite reasonable to equate a nexml object with a character matrix (because that's all they contain), and then it becomes most intuitive to ask the nexml object what the number of characters are.

cboettig commented 5 years ago

That your intuition for learning about the number of characters is to get the character matrix and then count its columns is perhaps influenced by your view of the nexml object as an assembly of different types of data

:100:% this. It does get me thinking though, if RPhenoscape is all about the character matrix, maybe making the user work around the full nexml object makes the codebase more cumbersome than it needs to be? e.g. would it be possible for RPhenoscape to "hide" the nexml complexity from the user, so that even though phenoscape data comes to the user as nexml, RPhenoscape basically presents it to the user as matrix data? I think that might make the RPhenoscape interface more intuitive for these kind of operations? It sounds like these matrices can be pretty big, so my instinct is that RNeXML is pretty slow at parsing them.

I've recently been exploring workflows for R packages that get data like this and stream it into a data.frame, or if it's too big for memory, into a fast SQL database like MonetDBLite or duckdb over a DBI connection. Users can then interact with it using standard dplyr verbs. After a one-time import step this is fast and scales well.

In the case of RNeXML, I'm still exploring doing this via RDF instead -- NeXML is translated into JSON-LD, so I can treat the whole thing as RDF. From R, I can dump that into a RDF store in memory or into a Virtuoso instance (I recently wrote an R interface for that, https://cran.r-project.org/web/packages/virtuoso/index.html). Getting a character matrix back from virtuoso is then just a simple SPARQL query. This RDF route is probably less practical than the SQL approach, but I'm still wondering if it has any real world use cases, particularly given that NeXML can be pretty semantically rich.

ropensci / RNeXML

General purpose accessor functions for nexml object inspection #232