ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

genbankr #47

Closed gmbecker closed 8 years ago

gmbecker commented 8 years ago
Package: genbankr
Version: 1.1.3
Title: Parsing GenBank files into semantically useful objects
Description: Reads Genbank files.
Authors@R: as.person(c(
      "Gabriel Becker <becker.gabriel@gene.com> [aut, cre]",
      "Michael Lawrence <lawrence.michael@gene.com> [aut]"))
Copyright: Genentech, Inc.
Depends:
    methods
Imports:
    BiocGenerics,
    IRanges,
    GenomicRanges(>= 1.23.24),
    GenomicFeatures,
    Biostrings,
    VariantAnnotation,
    rtracklayer,
    S4Vectors,
    GenomeInfoDb,
    Biobase
Suggests:
    RUnit,
    rentrez,
    knitr
Maintainer: Gabriel Becker <becker.gabriel@gene.com>
VignetteBuilder: knitr
License: Artistic-2.0
RoxygenNote: 5.0.1.9000
biocViews: Infrastructure, DataImport
NeedsCompilation: no

This package will not go on CRAN, as it is and will continue to be published as a Bioconductor package (with ROpenSci co-branding, pending success of this submission).

sckott commented 8 years ago

Thanks for the submission @gmbecker

sckott commented 8 years ago

Seeking reviewers

sckott commented 8 years ago

Reviewers: @dwinter Due date: 2016-06-15

sckott commented 8 years ago

@dwinter Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)

dwinter commented 8 years ago

Hi @sckott / bot / @gmbecker -- sorry about the delay, I've been working on this but it will take a little longer. Will aim for this week though.

sckott commented 8 years ago

thanks!

dwinter commented 8 years ago

This is a really nice package, which fills an important gap in the tools available to work on sequence data in R. Genbank is a very large sequence database, and genbank-formatted records provide rich information about the sequences it contains. This package provides tools to fetch and parse genbank records, and a set of functions to extract information from the parsed records. Importantly, the paclage integrates well with Bioconductor allowing the information within sequence files to be represented as flexible and useful objects like those provided by GenomicRanges.

Clashes between ropensci and Bioconductor styles

One of the difficulties in reviewing the package is that genbankr is part of the Bioconductor project, and therefore meets different coding standards and practices than those suggested for ropensci. Obviously it's up to the ropensci leaders and @gmbecker to decide how this can be negotiated, I will just list those I find here.

Readme

This is perhaps another example of the different between each project. In ropensci the github repo is a "landing page" for the package, whereas as Bioconductor presents much of the same information here. Even so, the current README for genbankr source code could be improved by adding

I have to say I also find the the installation instructions a little daunting. Perhaps start by emphasizing that almost all users should use the current Bioconductor release, which can be installed easily with biocLite. Can the right version of biocLite be accessed via library(BiocInstaller)? If so, these seems preferable to telling users to source a URL to install packages.

Documentation

I found package-level function to be very good, both clear and complete. Though perhaps not technically documentation, the print function for genbank records is also well thought-out, providing the user wih a clear indication of what each record contains.

The package vignette is also clear and does a good job of demonstrating basic usages and integrations for the package. I think it would also be very helpful to show how the features in a genbank records can be used in an analysis.

For instance, an example demonstrating the best way to use features in a given file and getSeq to retriece only the CDS from an mRNA sequence would document a possible use-case and demonstrate how integration with GenomicRanges can be helpful.

For the example of reading data into and R sessions, I think it's better to store the data in inst/exdata and use system.file to specify the path in vignette. That way a user can copy and paste the example and see it work. (Hadley's book has a section on this)

I appreciate the vignette section of the "limitations" fo the package. The lack of standards as to exactly what goes into a genbank file means such limitations are inevitable, but this clear statement about how those problems are dealt with is helpful.

The code

I have very little to comment on the code itself -- it's well written, clear and appropriately modular. I have not stress tested the package on very large files, but performance on typically-sized files has been good. I don't think there are an missing functions -- the package does a good job of "doing job well", so I like the focus of the current packages is good.

@aahowel3 and I have been using the package in research this week, and we found a couple of records that break the parsing function. These are filed under issue gmbecker/genbankr#1

Everything else

It would be good to have a code of conduct -- even if you envisage mostly working on the project yourself it's useful to have an indication that that would-be collaborators can expect to be treated well.

As mentioned above, it would be useful to have a clear indication of where bugs and/or questions about the package should be sent.

I think the the VignetteBuilder entry in the DESCRIPTION should be changed to rmarkdown, and knitr should be replaced by rmarkdown in the Suggests section too. With these changes I can compile the vignette, run R CMD check and have all tests pass.

Finally, the reviewer recommendations ask for an estimate of the time taken to review the package. I guess I spend about 2hrs using the package in research (i.e. doing stuff I would otherwise be working on, probably less efficiently :smile:), another 2 closely reviewing the code, design and documentation of the package and one hour to prepare this comment.

gmbecker commented 8 years ago

@dwinter

Thank you for the the thorough review. I have addressed and closed the issues you kindly filed in my genbankr repository. Please find my initial responses to the other points of your review below:

Style

Unfortunately, in many cases my use of camelCase are not cases of me adopting the Bioconductor style, but rather my package providing methods to existing Bioconductor generics (e.g., getSeq, cdsBy, etc). As such I'm unable to change the function naming style of my package. In the interest of full disclosure, I would have chosen to use camelCase rather than snake_case anyway, as it is my preference, but that is moot as consistency is paramount and my hands are tied regarding the names of numerous methods my package provides.

Readme

The BiocInstaller package is not itself available on CRAN. The mechanism for initially installing it is the sourcing of the file you mention. Once that is installed, you are correct that library(BiocInstaller) is sufficient, and the sourcing the file is no longer needed. I attempted to communicate that via the readme but I can revisit that if it was unclear.

With respect to badges, these are avaiable at genbankr's Bioconductor splashpage here: https://bioconductor.org/packages/release/bioc/html/genbankr.html. I figured that was sufficient. It looks like I might be able to jury rig the Bioc badges to show up on the github page. I am somewhat loath to do that, but I can look into it if it is a sticking point.

I will add some mention of the Bioconductor help site to the Readme, I agree that that isn't super clear now.

Documentation

I have changed the vignette so that it uses system file to retrieve the example genbank file (this is in devel only, not backported to the current release).

I don't really have the applied subject material expertise to determine what a compelling, more realistic example would be, so I have not added one now. I will talk to people I know are using it and try to work one in for the next release. I will add the mini-applications the reviewer mentions as well for that release.

Code

Thank you for your kind words. As mentioned, I have fixed the two bugs you identified in your issue. Please bring any future problems you find to my attention as well, and the turn-around time should be shorter when I'm not in the middle of a week of conferences.

Everything else

I would have thought that once genbankr is accepted into ROpenSci, the ROpenSci code of conduct would apply. Is that not the case? A package having a code of conduct strikes me as a bit odd (it's not really a community...) but I am of course for all the things that codes of conduct protect and enforce, and would do so within my package's development environment regardless.

@sckott (or @noamross ?) Please let me know if these responses are acceptable, and if not where more work on my part is required.

sckott commented 8 years ago
gmbecker commented 8 years ago

I have modified the README.md file to contain a link at the top to the official Bioconductor splashpage for genbankr.

I've also added CONDUCT.md.

PRs to this github repository should actually be fine, so I haven't made any changes there.

Please let me know if there's anything else outstanding that I need to address.

sckott commented 8 years ago

as mentioned above:

Can those be added to readme, or some reason why not?

p.s. @dwinter - thanks for the review hrs estimate

gmbecker commented 8 years ago

Hmm, I missed this message somehow. My bad.

I'd prefer any example usage code that has any meat to it live somewhere where it's compiled (and thus automatically tested), e.g. the vignette or examples in the documentation. I suppose I can add a non-run example call to readGenBank to the readme. Going beyond that seems like it is approaching a vignette that never gets built.

I'll add the link to bioc support, I had meant to do that.

gmbecker commented 8 years ago

Please see the latest commit of README.md, which now includes very basic usage, a link to the vignette for more complete usage examples, and a section on getting help and filing bugs and feature requests.

sckott commented 8 years ago

@gmbecker Looks good to me

approved