ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

Comments from Reviewer 2 #121

Closed cboettig closed 9 years ago

cboettig commented 9 years ago

Overall, I think this package is a fantastic resource. This is a greatly needed tool in biodiversity sciences and you have done a excellent job with the package, the manuscript, and the supplementary materials. I look forward to seeing this published and hope that it catches on in the community. I have a couple of criticisms that I would I like to see addressed before I can recommend this for publication.

Major Comments on manuscript

I have two (related) critiques of this manuscript. The goal of the RNexML project is to facilitate the use of NeXML standards in phylogenetics and comparative methods. While the package certainly goes a long way towards this goal, I feel that the manuscript could be better in this regard.

  1. I think you need a better hook in the opening (and the abstract). Why should your average empirical biologist care whether their files are in NeXMl format/why is it worth bothering with (i.e., why not just deposit .tre and .csv files?) I can imagine a few "killer apps", such as meta-analyses (both formal and informal) and populating databases (such that it makes it easier to load trees into OpenTree) but these aren't really described. While the manuscript describes why NeXML is superior to NEXUS, I don't feel you provide sufficient examples or motivation as to why exactly this is important. Perhaps this is not your role here -- you are not developing the NeXML standards in this paper -- but if the goal is facilitating adoption among empiricists, I think you need to be better marketers.
  2. The manuscript is jargon-laced and in some places, unnecessarily so. Even though I am a programming dork (or at least more so than the average biologist), I did not understand what you were referring to in some places. I think this is another barrier that will hinder adoption -- I am afraid that an empirist will take one look at this and conclude it is not for them. Here is a list of some examples (there are others) where I think the terminology is confusing; in some places, I think the jargon could be cut without loss, while in others I think some explanation in plain language would go a long way.

line 10: forward-compatible

line 17: normative

line 60: validated

line 64: computable semantics

line 67: axiomated ontologies

line 202: subject-predicate-object triples

line 205: Dublin core

line 210: vocabularies

line 240: SKOS vocabulary

line 248: SPARQL

Minor comments on manuscript

line 12: you aim to provide, not the software

line 83: where is the branch length information stored (in topology or metadata)?

line 44: an extra sentence here would be useful explaining why the needs of interoperability are greater now then they have been in the past

line 141: I would appreciate it if you would please use Pennell et al. 2014 Bioinformatics as the geiger citation

line 274: PDF weirdness

Comments on code

  1. In your example of writing nexml files, you use the geospiza dataset from geiger. In this dataset, the tip labels in the tree and the dataset do not perfectly match
data(geospiza) geiger::treedata(geospiza$phy, geospiza$dat)

Is this a feature or a bug of nexml? Either way, I think it is worth pointing out.

  1. I tried running the function nexml_validate while offline and received an error message (even though I had the XML package installed and loaded)
nexml_validate("geospiza.xml") Error in function (type, msg, asError = TRUE)  :
Could not resolve host: www.nexml.org
  1. I love the idea of being able to programmatically archive data (in figshare or wherever else) but am wondering whether there is a potential problem in that people may inadvertantly archive the same data over and over again. If for example, one sets up the workflow to be completely reproducible, will the function add a new version of the dataset every time the script is run. Are there safeguards built in for this?

But as I said above, overall I am very enthusiastic about this project. Great work. Matt Pennell

rvosa commented 9 years ago

Are we still in the process of formulating a response to this? Sorry, I've been traveling, not sure if I need to provide input here.

cboettig commented 9 years ago

nope, we're all set & waiting on the editor's decision. I only just got permission to publically post the replies we made to reviewer 2 and the associate editor, so I can have those up soon. Sharing the reviews that the reviewers agreed to share has been a very new thing for MEE, so this process has been slightly bureaucratic.

On Wed, May 20, 2015 at 12:34 PM Rutger Vos notifications@github.com wrote:

Are we still in the process of formulating a response to this? Sorry, I've been traveling, not sure if I need to provide input here.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/121#issuecomment-104008196.

rvosa commented 9 years ago

Ah, ok - I thought that's where we were in the process. Just making sure.

cboettig commented 9 years ago

I have two (related) critiques of this manuscript. The goal of the RNexML project is to facilitate the use of NeXML standards in phylogenetics and comparative methods. While the package certainly goes a long way towards this goal, I feel that the manuscript could be better in this regard.

  1. I think you need a better hook in the opening (and the abstract). Why should your average empirical biologist care whether their files are in NeXMl format/why is it worth bothering with (i.e., why not just deposit .tre and .csv files?) I can imagine a few "killer apps", such as meta-analyses (both formal and informal) and populating databases (such that it makes it easier to load trees into OpenTree) but these aren't really described. While the manuscript describes why NeXML is superior to NEXUS, I don't feel you provide sufficient examples or motivation as to why exactly this is important. Perhaps this is not your role here -- you are not developing the NeXML standards in this paper -- but if the goal is facilitating adoption among empiricists, I think you need to be better marketers.

This is an excellent point also raised by the Associate Editor. We have tried to detail some of this potential in the introduction, but ultimately feel that this task requires a review of the biodiversity informatics literature that is beyond the scope (or word count!) of this applications note, and instead must point the readers to some of the excellent and accessible reviews of these concepts and possibilities elsewhere; e.g. Parr et al. paper in TREE.

  1. The manuscript is jargon-laced and in some places, unnecessarily so. Even though I am a programming dork (or at least more so than the average biologist), I did not understand what you were referring to in some places. I think this is another barrier that will hinder adoption -- I am afraid that an empirist will take one look at this and conclude it is not for them. Here is a list of some examples (there are others) where I think the terminology is confusing; in some places, I think the jargon could be cut without loss, while in others I think some explanation in plain language would go a long way.

Thanks, the list below is very helpful. In addition, we have cut back the jargon considerably throughout the introduction and in some of the most dense sections such as the advanced metadata use.

line 10: forward-compatible

Replaced with 'extensible'

line 17: normative

Dropped. ("XML schema" is sufficient).

line 60: validated

We've clarified the definition that follows to read:

"i.e., it can be verified whether a file precisely follows this grammar, and therefore whether it can be read (parsed) without errors by software that uses the NeXML grammar (e.g. RNeXML) is predictable"

line 64: computable semantics

Now defined in the sentence following as:

"i.e., it can be verified whether a file precisely follows this grammar, and therefore whether it can be read (parsed) without errors by software that uses the NeXML grammar (e.g. RNeXML) is predictable"

line 64: computable semantics

Now defined in the sentence following as:

"it is designed for expressing metadata such that machines can understand their meaning and make inferences from it. For example, OTUs in a tree or character matrix for frog species can be linked to concepts in a formally def ined hierarchy of taxonomic concepts such as the Vertebrate Taxonomy Ontology [@Midford_2014], which enables a ma chine to infer that a query for amphibia is to include the frog data in what is returned. (For a more broader dis cussion of the value of such capabilities for evolutionary and biodiversity science we refer the reader to @Parr2 011.)"

line 67: axiomated ontologies

removed. The term "ontology" is re-introduced above, now with a definition, though the distinction between vocab ulary and ontology is glossed over, and details are left to the citation to provide.

line 202: subject-predicate-object triples

removed

line 205: Dublin core

removed

line 210: vocabularies

defined in introduction, see comment re line 64.

line 240: SKOS vocabulary

defined and linked

line 248: SPARQL

defined and linked

Minor comments on manuscript

line 12: you aim to provide, not the software

fixed

line 83: where is the branch length information stored (in topology or metadata)?

along with the topology, thanks, this has been made explicit now.

line 44: an extra sentence here would be useful explaining why the needs of interoperability are greater now then they have been in the past

Less diversity in tools meant their was less to inter-operate between. MrBayes decided it needed something that couldn't be represented in the original NEXUS so they defined a new convention for it. So did PAUP, and so on, and so MrBayes NEXUS isn't the same as a PAUP NEXUS file. While word count constraints don't permit a proper treatment, the Vos et al paper cited there which introduces the NeXML format provides a thorough discussion of this point.

line 141: I would appreciate it if you would please use Pennell et al. 2014 Bioinformatics as the geiger citation

Done, thanks. (You may want to update the citation information in the geiger package as well; see citation("geiger").

line 274: PDF weirdness

Fixed.

Comments on code

  1. In your example of writing nexml files, you use the geospiza dataset from geiger. In this dataset, the tip labels in the tree and the dataset do not perfectly match
data(geospiza) geiger::treedata(geospiza$phy, geospiza$dat)

Is this a feature or a bug of nexml? Either way, I think it is worth pointing out.

Yes, this is intentional and we have added a short clarification to the text:

"Note that the NeXML format is well-suited for incomplete data: for instance, here it does not assume the character matrix has data for every tip in the tree."

The function simply generates a tree object with all available data. As you observe, a user can of course use geiger::treedata if they need to drop incomplete taxa from either the tree or traits list, but that is not the role of NeXML, which seeks to represent whatever data is available. Note also that because NeXML is easily extended, you could in fact dump many character matrices into it, each perhaps missing certain taxa, and then use RNeXML to help extract as complete a character matrix as possible from the assembly.

  1. I tried running the function nexml_validate while offline and received an error message (even though I had the XML package installed and loaded)
nexml_validate("geospiza.xml") Error in function (type, msg, asError = TRUE)  :
Could not resolve host: www.nexml.org

We have since patched the R package so that it will use a fallback method to validate NeXML if the online validator is not available. This function will now issue a warning if it cannot connect to the online validator before using the fallback method.

  1. I love the idea of being able to programmatically archive data (in figshare or wherever else) but am wondering whether there is a potential problem in that people may inadvertantly archive the same data over and over again. If for example, one sets up the workflow to be completely reproducible, will the function add a new version of the dataset every time the script is run. Are there safeguards built in for this?

Good point -- we have updated this example to publish only a private (draft) version to figshare. This reserves the identifier and facilitates collaboration, while a user will still need to login online and flip the switch on figshare to make the data public. These drafts can also be deleted by the user either via the online interface or the rfigshare R package.