ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

Comments from the Associate Editor (posted with permission from MEE & editor) #120

Closed cboettig closed 9 years ago

cboettig commented 9 years ago

I have received reviews from two reviewers; both are overall very positive. I share their enthusiasm and think that this has the potential to be an important and widely used package The concerns that the reviewers express are very similar, however, and I would urge the authors to consider them carefully when making their revisions.

The article reads as if it is pitched to someone who has mostly bought into concepts of strictness that XML encourages, knows what an ontology is and is excited about things like RDF. While I'm sure these things are all potentially very exciting things that can shape how people work, I honestly don't think that they are on the radar of most biologists. It seems that metadata quality is enhanced by the formality and standardisation on one hand, and by the number of people using it on the other. You can have the most beautiful annotations in the world but if it's difficult to users to use then they won't use it and the advantage of standardisation are lost. Getting people excited to use them is probably more important here than speaking to the small minority who are already converts.

A bit more of a focus on how the file format and R package can help users, rather than statements that you can get more things into it would be useful. The section on metadata (p8 - 13) is a prime example of this. It's clear that this is something that is important to the authors, but it's framed starting from a set of arbitrary looking commands, rather than from a problem that is not currently solvable.

Similarly flagging which bits are for package developers (see below) might be helpful. The strength of the ape package is that people have built other packages on top of it (~100 packages on CRAN). Features pitched at package developers would necessarily be more technical in nature.

While the horse has well and truly bolted, I feel that S4 is the wrong technology for this sort of thing. S4 classes will be a barrier for people to contribute to the package (personally I wish they could be removed from the language). Nothing can be done about this, but that ape remains the primary way people interact with trees in R speaks to the benefits of ease of use and extension over formality for the R user and developer base at large (in contrast with ape, I see 4 declared dependencies on phylobase on CRAN).

Similarly, I wish the abstractions involved with XML were not so leaky. Does the average user need to understand about Dublin Core and XML namespaces to be able to append metadata to their phylogenies? Or do the authors imagine other user-friendly tools being built on top of RNeXML, with this package being some sort of solid low-level commands? If one needs to deal with XML to read/write files that will be a serious barrier to use; most biologists should not need to care how their files are being loaded/saved.

Despite the focus on formal validity (ostensibly the benefit of XML), it's somehow disturbing to me that there are additional checks not captured by the schema (l. 150). I didn't get a sense of what these additional checks are.

The package as explained in the title is for reading/writing XML files of a particular format. It's not clear why this takes >2700 lines of code (excluding comments). Perhaps this is the necessary verbosity of dealing with S4 and XML, but I wonder if the authors are underselling some of the functionality of the package. If the latter is true, that is their call of course.

l. 183 "a time-stamp of when a tree was produced": is this always a good idea? I presume "produced" here is when the tree is written to file. So reading and immediately writing a tree would update the timestamp? This would mean that it's impossible to verify the results of an analysis are unchanged by comparing file hashes even if the contents are substantially unchanged.

l. 196: for extending the syntax - will this not lead to downstream incompatibilities as with the NEXUS format? This section will be quite opaque to most readers. "The RNeXML interface described above for built-in metadata annotations perform this cast automatically behind the scenes, including mapping metadata attributes to terms in requisite common vocabularies (such as Dublin Core for "title", "creator", etc.) or ontologies".

l. 272: I'm honestly not sure what this is meant to express. I can't imagine that typing something like this is going to inspire people to get excited about using the package. Presumably nobody would actually write such verbose commands, but package authors (e.g. Revell if using RNeXML storing data for reconstructions) would write functions that generate calls like this?

cboettig commented 9 years ago

(Our replies to the associate editor, also posted with permission now)


A bit more of a focus on how the file format and R package can help users, rather than statements that you can get more things into it would be useful. The section on metadata (p8 - 13) is a prime example of this. It's clear that this is something that is important to the authors, but it's framed starting from a set of arbitrary looking commands, rather than from a problem that is not currently solvable.

This section has been substantially revised to be more accessible and problem focused. We have tried to strike a balance between documenting the technical aspects of the file format and R package while also addressing the bigger picture of appealing to users.

Similarly flagging which bits are for package developers (see below) might be helpful. The strength of the ape package is that people have built other packages on top of it (~100 packages on CRAN). Features pitched at package developers would necessarily be more technical in nature.

We have at several places highlighted sections that should be considered more advanced. We have tried to present a continuum of complexity rather than strictly discriminating between users and developers; just as the R language itself blurs this distinction much more than previous, easy-to-use GUI driven software. (No doubt ape owes some of its success in accumulating reverse dependencies to the very fact that it did not much separate the user from the developer).

We bear in mind too that some readers may come from a biodiversity informatics background instead of a background of using ape, and for whom a transparent expression of formal semantics concepts that will be foreign to the typical R user will be useful signposts. Having co-authors without any prior experience in R has helped us substantially in this regard.

While the horse has well and truly bolted, I feel that S4 is the wrong technology for this sort of thing. S4 classes will be a barrier for people to contribute to the package (personally I wish they could be removed from the language).

We have designed RNeXML so that users can take advantage of its functionality and even develop extensions without knowledge of S4 using the function API. We also note that the XML R package is closely integrated with S4, automating many things that would otherwise be very tedious and error prone. A developer seeking to do things that cannot be done already be done with the user-facing RNeXML functions would have to confront the conversion of R structures to XML as we did, and thus would also benefit more than they lose by being able to take advantage of the S4-specific tools in the XML package.

Nothing can be done about this, but that ape remains the primary way people interact with trees in R speaks to the benefits of ease of use and extension over formality for the R user and developer base at large (in contrast with ape, I see 4 declared dependencies on phylobase on CRAN).

We agree and are not trying to displace the ape structure. The RNeXML package works seamlessly to both generate ape::phylo trees from NeXML files and serialize ape::phylo trees as NeXML.

Similarly, I wish the abstractions involved with XML were not so leaky. Does the average user need to understand about Dublin Core and XML namespaces to be able to append metadata to their phylogenies? Or do the authors imagine other user-friendly tools being built on top of RNeXML, with this package being some sort of solid low-level commands? If one needs to deal with XML to read/write files that will be a serious barrier to use; most biologists should not need to care how their files are being loaded/saved.

Excellent points, much of this jargon has now been removed (e.g. 'Dublin Core,' and see further comments below). Where a concept is essential (such as 'namespace') we have explained the definition and the purpose more carefully. We present examples of reading and writing that require no familiarity with XML. As noted above, we still introduce these topics in the more advanced sections, as they may be of interest to the biodiversity informatics community or the R user seeking to take advantage of those concepts.

Despite the focus on formal validity (ostensibly the benefit of XML), it's somehow disturbing to me that there are additional checks not captured by the schema (l. 150). I didn't get a sense of what these additional checks are.

These are checks constraints on nested references to identifiers (as detailed in the citation). An example of such a constraint is as follows: "a node in this trees block may only reference a taxon inside the taxa block referenced by the trees block that contains the node". You simply can't say that in XML schema (it's hard enough to say it in English :-)), so you have to check for this kind of referential integrity in the application. It is perhaps not clear why this is disturbing; for example, it is not possible to define such constraints in relational databases such as SQL, where similar checks are often done at the application level, as we do here.

The package as explained in the title is for reading/writing XML files of a particular format. It's not clear why this takes >2700 lines of code (excluding comments). Perhaps this is the necessary verbosity of dealing with S4 and XML, but I wonder if the authors are underselling some of the functionality of the package. If the latter is true, that is their call of course.

Additional functionality is covered in the four vignettes, covering topics of more special interest. Specifying formal class structures and methods does require a bit of verbage.

l. 183 "a time-stamp of when a tree was produced": is this always a good idea? I presume "produced" here is when the tree is written to file. So reading and immediately writing a tree would update the timestamp? This would mean that it's impossible to verify the results of an analysis are unchanged by comparing file hashes even if the contents are substantially unchanged.

This is the default behavior and can be configured. We have edited the text to address this now.

l. 196: for extending the syntax - will this not lead to downstream incompatibilities as with the NEXUS format?

Correct, as we state later in this section, the result is still valid NeXML and can still be parsed by any of the many parsers available for NeXML (available in ruby, python, perl, java, and now R).

This section will be quite opaque to most readers. "The RNeXML interface described above for built-in metadata annotations perform this cast automatically behind the scenes, including mapping metadata attributes to terms in requisite common vocabularies (such as Dublin Core for "title", "creator", etc.) or ontologies".

We have revised the introduction of this section accordingly:

"The RNeXML interface described above for built-in metadata allows users to create precise and semantically rich annotations without confronting any of the complexity of namespaces and ontologies. Nevertheless, advanced users may desire the explicit control over these semantic tools to take full advantage of the flexibility and extensibility of the NeXML specification [@Vos2012; @Parr2011]. In this section we detail how to accomplish these more complex uses in RNeXML."

As we have tried to highlight in the revisions, you are quite correct that this section will probably be opaque to most readers who are not already familiar with the biodiversity informatics literature. Nonetheless we wanted to illustrate for more advanced users or readers familiar with that literature how it could be applied in RNeXML. We do leave the most complex examples, such as SPARQL queries that illustrate this machine reasoning in action, to the supplementary materials.

l. 272: I'm honestly not sure what this is meant to express. I can't imagine that typing something like this is going to inspire people to get excited about using the package. Presumably nobody would actually write such verbose commands, but package authors (e.g. Revell if using RNeXML storing data for reconstructions) would write functions that generate calls like this?

This example has been removed from the text. It represented a simmap tree shown in Revell 2012. Correct, no one would write such code by hand, as we had already stated. Rather than present this example and then introduce the helper functions that illustrate how a developer would implement this, we now jump directly to discussing the nexml_to_simmap and simmap_to_nexml, which convert between this format and Revell's extended ape::phylo simmap format. This section has also been shortened as more details are already available in the simmap vignette.