ropensci / software-review

rOpenSci Software Peer Review.
289 stars 104 forks source link

drake (R package) #156

Closed wlandau-lilly closed 6 years ago

wlandau-lilly commented 6 years ago

Summary

The drake package is an R-focused pipeline toolkit. It reproducibly brings results up to date and automatically arranges computations into successive parallelizable stages. It has a Tidyverse-friendly front-end, powerful interactive visuals, and a vast arsenal of multicore and distributed computing backends.

Package: drake
Title: Data Frames in R for Make
Version: 4.4.1.9000
Authors@R: c(
  person(
    family = "Landau",
    given = c("William", "Michael"),
    email = "will.landau@lilly.com",
    role = c("aut", "cre")),
  person(
    family = "Axthelm",
    given = "Alex",
    email = "aaxthelm@che.IN.gov",
    role = "ctb"),
  person(
    family = "Clarkberg",
    given = "Jasper",
    email = "jasper@clarkberg.org",
    role = "ctb"),
  person(
    family = "Eli Lilly and Company",
    role = "cph"))
Description: A solution for reproducible code and 
  high-performance computing.
License: GPL-3
Depends:
  R (>= 3.2.0)
Imports:
  codetools,
  crayon,
  eply,
  evaluate,
  digest,
  formatR,
  future,
  grDevices,
  igraph,
  knitr,
  lubridate,
  magrittr,
  parallel,
  plyr,
  R.utils,
  rprojroot,
  stats,
  storr (>= 1.1.0),
  stringi,
  stringr,
  testthat,
  utils,
  visNetwork,
  withr
Suggests: 
  abind,
  DBI,
  future.batchtools,
  MASS,
  methods,
  RSQLite,
  rmarkdown,
  tibble
VignetteBuilder: knitr
URL: https://github.com/wlandau-lilly/drake
BugReports: https://github.com/wlandau-lilly/drake/issues
RoxygenNote: 6.0.1

Similar work

Remake

Drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and @richfitz, and explicit acknowledgements are in the documentation. However, drake surpasses remake in several important ways, including but not limited to the following.

  1. High-performance computing. Remake has no native parallel computing support. Drake, on the other hand, has a vast arsenal of parallel computing options, from local multicore computing to serious distributed computing. Thanks to future, future.batchtools, and batchtools, it is straightforward to configure a drake project for most popular job schedulers, such as SLURM, TORQUE, and the Sun/Univa Grid Engine, as well as systems contained in Docker images.
  2. A friendly interface. In remake, the user must manually write a YAML configuration file to arrange the steps of a workflow. In drake, this configuration is based on data frames that built-in wildcard templating functionality easily generates at scale.
  3. Thorough documentation. Drake contains eight vignettes, a comprehensive README, examples in the help files of user-side functions, and accessible example code that users can write with drake::example_drake().
  4. Active maintenance. Drake is actively developed and maintained, and issues are usually solved promptly.

Factual's drake

Factual's drake is similar in concept, but the development effort is completely unrelated to the R package of the same name.

Other pipeline toolkits

There are many other successful pipeline toolkits, and the drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and parallel computing flexibility.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

I plan to submit to JOSS in the future, but the manuscript is not currently ready.

Detail

sckott commented 6 years ago

Thanks for your submission @wlandau-lilly ! Editors are discussing now

wlandau-lilly commented 6 years ago

Thanks, @sckott.

Edit: I think @richfitz would be an excellent reviewer due to the similarity of remake, but I understand if a potential conflict of interest precludes his participation.

wlandau-lilly commented 6 years ago

I just ran goodpractice::gp() on wlandau-lilly/drake@ee475f103f514905d3ed9c3a5dd7b2cacbc46021:

It is good practice to

  x avoid the attach() and detach() functions, they are
    fragile and code that uses them will probably break sooner than
    later.

    tests/testthat/test-Makefile.R:165:3
    tests/testthat/test-Makefile.R:167:3
    tests/testthat/test-Makefile.R:192:3
    tests/testthat/test-Makefile.R:194:3

  x avoid calling setwd(), it changes the global environment.
    If you need it, consider using on.exit() to restore the working
    directory.

    tests/testthat/test-cache.R:43:3
    tests/testthat/test-cache.R:153:3
    tests/testthat/test-cache.R:216:3
    tests/testthat/test-cache.R:225:3
    tests/testthat/test-cache.R:241:3

I can explain these idiosyncrasies.

Calls to detach() in tests/testthat/test-Makefile.R

Drake promises to load the user's packages, which is especially important for distributed computing across multiple nodes on a cluster. To test, I occasionally need to call detach() to remove packages from search(). Unfortunately, unloadNamespace() does not have the desired effect.

Calls to setwd() in tests/testthat/test-cache.R

By default, drake searches through parent directories to find the current drake project's storr cache. To test, I need to change directories. But rest assured: every test is wrapped in a call to test_with_dir(), which uses withr::with_dir() to ensure that the original working directory is restored. Nested calls to withr::with_dir() give me errors.

maelle commented 6 years ago

Editor checks:


Editor comments

Thanks for your submission @wlandau-lilly. Running goodpractice::gp() is actually my role but now we have your comments (and I get the same flags) so all is good. :wink:

devtool::spell_check identified:

I'm now looking for reviewers.


Reviewers: @jules32 @benmarwick @gothub Due date: 2017-01-04

maelle commented 6 years ago

@wlandau-lilly I forgot to mention you can now add this review badge to the README

[![](https://badges.ropensci.org/156_status.svg)](https://github.com/ropensci/onboarding/issues/156)

wlandau-lilly commented 6 years ago

Thanks, @maelle! I appreciate your forgiveness regarding goodpractice::gp(), and I just fixed the spelling mistake in wlandau-lilly/drake@5c9388a1a7873277332de26a3f8dc0de5bd94104. I will add the badge soon, and I am excited for the review process!

maelle commented 6 years ago

@wlandau-lilly, good news, the reviewers are now assigned!

@jules32 and @benmarwick thanks a lot for accepting to review this package! 😸 Your reviews are due on the 2017-12-04.

wlandau-lilly commented 6 years ago

Yes, thank you @jules32 and @benmarwick! I look forward to your feedback.

maelle commented 6 years ago

@jules32 and @benmarwick friendly reminder that your reviews are due on the 2017-12-04 😉

wlandau-lilly commented 6 years ago

@jules32 and @benmarwick, could we touch base about timing? Drake is large and developing fast, so I do understand that reviews may be more difficult than is typical.

maelle commented 6 years ago

I forgot to update the thread when @jules32 contacted me to say she'd get the review in before Dec the 11th, sorry.

@benmarwick any update?

Thanks to both reviewers and thanks @wlandau-lilly for your patience. :-)

jules32 commented 6 years ago

Hi @wlandau-lilly et al,

I am going to get started on Thursday and this weekend since I was out of the office last week. Looking forward to getting to know this package!

On Tue, Dec 5, 2017 at 7:53 AM, Maëlle Salmon notifications@github.com wrote:

I forgot to update the thread when @jules32 https://github.com/jules32 contacted me to say she'd get the review in before Dec the 11th, sorry.

@benmarwick https://github.com/benmarwick any update?

Thanks to both reviewers and thanks @wlandau-lilly https://github.com/wlandau-lilly for your patience. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/156#issuecomment-349347198, or mute the thread https://github.com/notifications/unsubscribe-auth/AFnnRe3X5gCGWqmWFexnMLySaL7_lLGRks5s9WbhgaJpZM4QYbv- .

--

Julia Stewart Lowndes, PhD Ocean Health Index National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara (UCSB) website http://jules32.github.io/ • ohi-science http://ohi-science.org/ • github https://github.com/jules32 • twitter https://twitter.com/juliesquid

benmarwick commented 6 years ago

Thanks for the reminders, here's my review:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

p = partial x = complete

For packages co-submitting to JOSS
  • [p] The package has an obvious research application according to JOSS's definition BM: I see that you do not plan to submit to JOSS at the moment, so this is just an incidental comment: It is easy to imagine research applications for drake, it is a very solid contribution to an active area of workflow and provenance tracking tools. However, the research application would be more obvious if the readme referred to some actual real-world uses of the package. For example, a list of domain-specific research project repos where drake is used (i.e. by biologists, economists, whatever), or a list of publications reporting results that were generated or enabled using drake. Currently it looks like drake has great potential, but hasn't actually been used in any real-world applications. Perhaps it has, but it's not clear from the pkg docs. Examples of use would help potential users better understand how drake can help them.

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 2


wlandau-lilly commented 6 years ago

@benmarwick, your advice is exactly what I needed! Even before the inception of drake, I have had a strong ambition to challenge the R community's current conventions around reproducible data analysis and research. I want to make drake more relatable, understandable, usable, and widespread, and it will be gratifying to elaborate on the practical niche. Thank you for steering me back on course.

I have already started working on the changes you requested. I expect to have ample time next week, but I will be on vacation from December 16 through January 3, and I will be totally off the grid and unreachable from December 25 through January 3. I will eagerly resume work on January 4.

wlandau-lilly commented 6 years ago

Response to @benmarwick's review

It was gratifying to work on this response, and the changes to drake were straightforward. For the past year, I have been struggling to find the best way to talk about drake. Your feedback allowed me to make substantial progress.

It could be more clear what drake does that no other package already does, Or why a user should use drake rather than make/remake/code chunks in Rmd/etc.

I agree. Please see the new "Similar work" section of the README (as well as the end of the new application.Rmd vignette). I now compare drake to Make, remake, and knitr.

The four points in 'similar work' section of the ropensci submission should also be in the readme.

Done. I added these points and expanded on them in the README.

That said, some of these are a matter of style, e.g. YAML vs data frames for config, rather that outright novel function. Are there any projects using drake that a user can inspect to see real-world applications? Publications that can be cited?

I have tried to search for real-world examples of drake in the wild, but I have not had success so far. I think it may be too early to see publicly-released projects and publications that use drake. However, I do know that @kendonB, @dapperjapper, and @AlexAxthelm are heavily using drake for their projects. I cannot share any of my own because they are all company confidential.

In the new "Similar work" section of the README, I now refer to real-world applications of Make and remake, and I argue that drake improves on both these tools for R users. I hope that helps. In the coming years, I will continue to search for publicly-available drake-powered projects, and I will keep a running list in the README.

My sense is that many packages aimed at workflow management or improving reproducibility are quite idiomatic. This makes it hard for the common-or-garden variety R-user to see how they fit into their own ways of working. If you can help to bridge that gap between your idioms and the average user, that would make this pkg much more useful and valuable to the community.

I completely agree! I have been struggling with communication this entire time. The first three sections of the current README and drake.Rmd vignette are new, and I think they are substantial improvements. I try to introduce drake using plain language, and I argue that it makes life easier.

I had no problem running these. However, I found the quickstart and examples difficult to relate to. For example, who writes their Rmd report in the console and passes it to an object as a character vector? That seems unnatural, to me at least, where the Rmd file is my main notebook and workbench. It would be easier to follow if the substance of the analysis was narrated in a little more detail. Perhaps a tiny actual research question with actual data would make this example more accessible? Perhaps also a comparison with a simple makefile to show makefile users (the main audience for this pkg) how to accomplish the same with drake (and why drake would be preferable). This would help a reader see how their existing workflow could be translated to the drake system. The drake system is a very comprehensive universe of functions, and new users will need a bit more guidance to see analogues between what drake does and what they're already using.

Absolutely! I was too entrenched in the details to realize this. I have added a new application.Rmd vignette for exactly this purpose, and I paired it with example code files that the user can generate with drake_example("application"). Here, I define a research question and use real data to address it. I also comment on how Make would be unwise for that particular use case.

Regarding the use of *.Rmd reports and knitr, please see the new knitr subsection of the README.

Could CONTRIBUTING.md be located at the top level of the repo for better visibility?

Done.

I see that you do not plan to submit to JOSS at the moment, so this is just an incidental comment: It is easy to imagine research applications for drake, it is a very solid contribution to an active area of workflow and provenance tracking tools. However, the research application would be more obvious if the readme referred to some actual real-world uses of the package. For example, a list of domain-specific research project repos where drake is used (i.e. by biologists, economists, whatever), or a list of publications reporting results that were generated or enabled using drake. Currently it looks like drake has great potential, but hasn't actually been used in any real-world applications. Perhaps it has, but it's not clear from the pkg docs. Examples of use would help potential users better understand how drake can help them.

I absolutely do plan to submit to JOSS in the future. Now is not the right time for me, however, and I am especially glad I received your feedback on the package itself first. And as I mentioned before, I am currently having trouble finding real-world applications of drake in the wild. I will continue searching, and I will gather and list them in the README when I see them.

Now I have a question: @maelle, may I return to this thread at a later date to fast-track a JOSS submission?

maelle commented 6 years ago

Thanks for your review, @benmarwick! Are you happy with the changes?

@wlandau-lilly, thanks for answering @benmarwick's review promptly. Three points from me:

maelle commented 6 years ago

Another argument in favour of pkgdown: you could create a grouping of functions as in this example which is what you have in your README now but without the documentation of each function accessible by one click.

I've also just noticed this phrasing "Most people think that means". Even if you have data underlying this, I think it looks a bit agressive here, maybe replace it "It does not only mean". :-)

wlandau-lilly commented 6 years ago

@maelle,

The JOSS submission would only need a paper.md and archived version of the repository. We do not need that before the end of onboarding, which will be at a later date. :wink:

Very much appreciated!

Why not create a website for the package using pkgdown? It'll make the vignettes easier to browse. See http://enpiar.com/2017/11/21/getting-down-with-pkgdown/

Good suggestion. Drake heavily relies on its vignettes, and pkgdown is a community standard for documentation. I expect to begin work on this soon.

Very naive question, does each command need to be something as basic as summary or could it be sourcing a larger script containing several regression calls?

I am glad you asked! Drake commands can be arbitrary R code (although I would avoid unquoted formulas because they may throw off the static code analysis that detects dependencies. This would not break make(), but it may create false positive messages about missing import objects or link spurious imported dependencies). So yes, you could have a large script containing several regression calls separated by ; or \n. This is yet another advantage over remake, which requires all commands to be single function calls with no nesting (except for I(), which declares string literals).

Inside drake, each command is wrapped in a protective function call in order to quarantine the side effects, so in general, only the return value of the code block should have an effect on the rest of make() (see wlandau-lilly/drake#39).

Large commands are not always good practice because they can make workflow plan data frames difficult to print properly. (Full disclosure: gather_plan() creates super long commands, so I am guilty.) I remember submitting a feature request to tibble to allow individual columns to be truncated, but I cannot seem to find the issue.

Every project needs a balance between having too many targets and assigning too much work to any individual target. The new application.Rmd vignette implicitly hints at a possible explosion in the number of targets for massive studies with crushing combinatorics. There is no one-size-fits-all solution.

I've also just noticed this phrasing "Most people think that means". Even if you have data underlying this, I think it looks a bit agressive here, maybe replace it "It does not only mean". :-)

You are right. In wlandau-lilly/drake@6bcfca7a12aafdcdb50cf5bb11904a8c3eaaac52, I just changed the sentence to "The R community likes to emphasize reproducibility, which one could interpret to mean..."

wlandau-lilly commented 6 years ago

FYI: the pkgdown site is now live. I love how it shows the vignettes!

maelle commented 6 years ago

Cool! I also like the grouping in the refrence! Some suggestions from me as a naive user (I am being a bit of a reviewer here, but feedback is feedback 😉):

Now to help you present drake to naive users 😀 I think you should start with a "why use drake" section with content from the 2 first sections and less code to convey the big message before code (convince newcomers at a glance). I know what reproducibility is (hopefully 😀) but I could choose not to ever learn a new tool and have a makefile.R which is a script with source calls to other scripts and knitting in the right order. I world re run the entire thing if the data change. This is how I would present internal consistency in the beginning of the readme. You can write why it is worth taking the time to learn drake (because ultimately potential users would need to make that decision and this while feeling too busy and/or not expert enough to learn a new tool): saving time in the future by not re running everything from scratch, by being able to use high performance computing (link to vignette), not too much learning time or frustration because great docs, and why drake vs other make tools (link to the related work) section). Really, phrasing the readme in a short way with these arguments is IMO a good marketing strategy because more experienced users of make like tools can just scroll down to related work while you catch beginners interest. I do not use any such tool yet and this is how I'd choose to stay on this website. User-friendliness/beginner-friendliness.

Hope this helps while waiting for the second review which might be postponed a bit. I think this is also consistent with what @benmarwick said.

Also ask the three current users you mentioned how they got introduced to the package but I imagine it was by discussing it with you since they are in the acknowledgements.

maelle commented 6 years ago

And thanks for your prompt answer to all feedback until now!

wlandau-lilly commented 6 years ago

Yes, the more feedback like this, the better! Super helpful!

could you make the order of groups, and of vignettes, from most important to least important/most complex (e.g. not starting with caching) instead of alphabetical?

Done.

could you make the readme more minimal? To be honest I find it overwhelming bc it is so dense, now with the site in place you can shorten it a bit. "Where to begin" and "handy functions" can disappear in favor of saying something like "drake has a documentation website. you can find a quick start in the quickstart vignette and more specific details about aspects such as parallel computing in the different articles listed" etc.

I removed the "Where to begin" and "handy functions" sections, and I explained the pkgdown site in the "Documentation" section.

I know the title of the readme is the origin of the name but it does not describe your package very well for newcomers. How about drake, a package to ensure reproducibility while saving you time?

Yes. I changed the title to "drake: stay reproducible and save time".

Now to help you present drake to naive users :grinning: I think you should start with a "why use drake" section with content from the 2 first sections and less code to convey the big message before code (convince newcomers at a glance). I know what reproducibility is (hopefully :grinning:) but I could choose not to ever learn a new tool and have a makefile.R which is a script with source calls to other scripts and knitting in the right order. I world re run the entire thing if the data change. This is how I would present internal consistency in the beginning of the readme. You can write why it is worth taking the time to learn drake (because ultimately potential users would need to make that decision and this while feeling too busy and/or not expert enough to learn a new tool): saving time in the future by not re running everything from scratch, by being able to use high performance computing (link to vignette), not too much learning time or frustration because great docs, and why drake vs other make tools (link to the related work) section). Really, phrasing the readme in a short way with these arguments is IMO a good marketing strategy because more experienced users of make like tools can just scroll down to related work while you catch beginners interest. I do not use any such tool yet and this is how I'd choose to stay on this website. User-friendliness/beginner-friendliness.

Please see the top of the new README. I added a "why use drake" section and kept the subsequent three sections the same. The README has changed so fast over the past few months that I forgot I no longer had an abstract-like overview at the top.

Also ask the three current users you mentioned how they got introduced to the package but I imagine it was by discussing it with you since they are in the acknowledgements.

maelle commented 6 years ago

Thanks! Illnow be travelling until Wednesday so will have a look then. ☺

jules32 commented 6 years ago

Hi @wlandau-lilly et al,

Sorry for the delay here. I'm based in Santa Barbara, California, and with the huge wildfire that is ongoing, we've left town in the last few days.

I did have a look, and had a lot of the same thoughts as @benmarwick in trying to understand how
the intended users ofdrake would know that it was the right tool for them. I know you are addressing some of Ben's comments before going on holiday; I'll have more comments for you in the New Year too.

Cheers, Julie

wlandau-lilly commented 6 years ago

Julie, I am sorry to hear that the fire came your way. I hope you, your family, and your friends are all safe and comfortable. Your feedback can wait as long as it needs to. Please be well.

jules32 commented 6 years ago

Thanks so much!

maelle commented 6 years ago

Thanks @jules32!

I assigned a 3d reviewer, @jeroen, to have a look at the implementation, not the interface. ☺

wlandau-lilly commented 6 years ago

Welcome, @jeroen. @maelle, this means we have a new timeline, right?

maelle commented 6 years ago

Yes! After discussing with @jeroen and @jules32 the new deadline is Jan the 4th after your vacation. Sorry about the process length but it'll have been worth it I think with such a reviewers dream team! 🦄🦄🦄

Will you soon have time for JOSS paper.md? See their instructions, it's really a short paper.

wlandau-lilly commented 6 years ago

That works for me. I will return refreshed and ready to respond.

I just read http://joss.theoj.org/about#author_guidelines, and it turns out that I completely misunderstood JOSS! I assumed that I would need to write a full-length journal article and that the expectations and process would be similar to JSS, etc.

I think a JOSS submission should be possible early next year, but it will take some time. Given all the great feedback I am about to receive through rOpenSci, I would rather wait. The paper.md will be quick to write, but my company requires a disclosure process for official academic journals, and there is not enough time left in the year to initiate a new disclosure. Also, each iteration of paper.md will need to be reviewed and approved all over again. But I can minimize the bureaucratic red tape if I am overprepared.

maelle commented 6 years ago

Ok great! I had a feeling you thought it was a long article. Have a great vacation!

wlandau-lilly commented 6 years ago

I have been thinking more about drake's accessibility to new users, especially @benmarwick's comment that it is difficult to relate to the vignettes. I did some expanding and refactoring, and as of now, two of the vignettes concentrate on more down-to-earth examples. Both run quickly to avoid bottlenecking the package quality checks, and the statistical methodology is elementary to keep things clear and simple. Each is of these vignettes is paired with a set of example code files (available via drake_example("packages") and drake_example("gsp")).

  1. example-packages.Rmd is a tiny analysis of R package download trends using cranlogs. In this situation, some of the input data needs to be refreshed every day. The point is to show how drake brings the project up to date without restarting everything from scratch.
  2. example-gsp.Rmd focuses on a real econometrics dataset. The goal here is to show that drake easily scales up with the number of targets but GNU Make does not.
maelle commented 6 years ago

Cool -- I was also wondering (and have not checked myself sorry) whether the R podcast episode about drake is listed in the documentation? It provides some useful context&history.

wlandau-lilly commented 6 years ago

Good question, @maelle. I have not mentioned it in the documentation, and I am still trying to decide whether I will.

wlandau-lilly commented 6 years ago

FYI: I just mentioned the podcast episode in the documentation section of the README (see the commit referenced above).

maelle commented 6 years ago

:wave: @jeroen and @jules32, friendly reminder that your review is due on Jan the 4th.😺

jeroen commented 6 years ago

😅

jeroen commented 6 years ago

I'm asking some help from @HenrikBengtsson to review the parallelism components.

jeroen commented 6 years ago

Sorry folks I'm not going to make the deadline. Can we push it back 1 month? 😯

jules32 commented 6 years ago

Hi All!

Sorry for the silence over break! I've just had the chance to review drake, but it is just a partial review at this point, partly because I'm approaching the 3 hour mark, and partly because I am getting an error installing the current version from GitHub (note: I know this is because you've been working on it in my silence! You are probably already fixing it, but the error is below).

My partial review is focused on the README and website, which have greatly improved following @benmarwick's comments and how @wlandau-lilly addressed them. I will put my suggestions in the following comment. I can plan to coordinate further edits with on @jeron's timeline if that's easiest.

In December (for the original deadline) I did begin to look through drake v4.4.1.9000, and installed it from GitHub with no problem.

However, tonight, January 3 I installed devtools::install_github("wlandau-lilly/drake", build = TRUE), which I believe is 4.4.1.9002 (not yet a GitHub release). I got the following install error due to Ecdat:

...
* preparing ‘drake’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Quitting from lines 17-25 (example-gsp.Rmd) 
Error: processing vignette 'example-gsp.Rmd' failed with diagnostics:
there is no package called 'Ecdat'
Execution halted
Installation failed: Command failed (1)
jules32 commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

JL: I am reviewing drake as an interface reviewer, and from a beginner-friendly angle. I played around with drake prior to @benmarwick's comments above, and have also followed the work that @wlandau-lilly has put towards addressing them.

This review is in progress, as I've just focused on the documentation for now.

Documentation

The package includes all the following forms of documentation:

# install drake from CRAN
install.packages("drake")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("wlandau-lilly/drake", build = TRUE)

JL: the following review is to be completed with drake v. 4.4.1.9002 or greater

For packages co-submitting to JOSS

JL: from the above thread it seems like this is in progress so I will wait to evaluate this

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

JL: the following review is to be completed with drake v. 4.4.1.9002 or greater

Final approval (post-review)

Estimated hours spent reviewing: 3.5


Review Comments

In December, I was running drake on my machine, and had no problems running the examples. But I had trouble seeing how to move past the examples to see how drake could be used from the ground up, and how *I* would use it. @wlandau-lilly has since made this a lot more clear, and I can see how this would be a good tool for more beginner-types to know about. These comments are kind of fine-tuning some of the work you've already done to help it resonate.

README suggestions

What gets done stays done

Having the Sisyphean loop example 1-4 is great: it's really helpful. It seems that with drake, this turns into something like:

  1. Launch the code
  2. Drake evaluates and rebuilds anything that has changed since the last run through

Is that true, and would that we worth itemizing like that in the README? And then here are some suggestions for commenting the example, which is a bit obvious but can be a bit easier to follow:

Example: `my_plan` lists 15 targets (analysis steps that have specific commands), and `drake` will evaluate them with its `make` function.

# Load drake's basic example and examine my_plan's analysis
library(drake) # install.packages("drake")
load_basic_example(verbose = FALSE)
head(my_plan)

##                    target                                      command
## 1             'report.md'             knit('report.Rmd', quiet = TRUE)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4       regression1_small                                  reg1(small)
## 5       regression1_large                                  reg1(large)
## 6       regression2_small                                  reg2(small)

# First round: drake builds all 15 targets.
make(my_plan) 

## target large
## target small
## target regression1_large
## target regression1_small
## target regression2_large
## target regression2_small
## target coef_regression1_large
## target coef_regression1_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression1_large
## target summ_regression1_small
## target summ_regression2_large
## target summ_regression2_small
## target 'report.md'

# Then, you change the reg2 function; this will affect all regression2 targets.
reg2 <- function(d){    
  d$x4 <- d$x ^ 4
  lm(y ~ x4, data = d)
}

# Second round: drake only builds what was updated.
make(my_plan)

## target regression2_large
## target regression2_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression2_large
## target summ_regression2_small
## target 'report.md'

# And if nothing was updated, drake doesn't try to rebuild.
make(my_plan)

## All targets are already up to date.

website: wlandau-lilly.github.io/drake

Mostly, these are small things that might be fixed with v. 4.4.1.9002 or greater.

  1. Something is wrong on the Reference page; instead of a short descriptor of the function, it repeats the function name after the word "Function". This was also the case in the R help for v. 4.4.1.9000.

  2. Get Started page:

    • I'd suggest starting this page with the "Where to Begin" part since the rest of it is on the homepage.
    • As you saw from my suggestion to the example code on the README, I really liked seeing the output from load_basic_example(verbose = FALSE); my_plan on this page. Seeing inside the my_plan variable is when I could really see myself using drake in my own workflow. It also lets us see reg2 before the example's second round.
  3. The packages example also really hit drake home for me. I like seeing drake_plan() being used to to assign the targets and commands that we've seen in load_basic_example().

  4. Small thing, but when I look at the website on Chrome, in the tab it is labeled "Data Frames in R for Make — Drake". I know that that is the origin of the name, but I agree with @maelle's comment above that it's not intuitive (especially as a person who has too many tabs open all the time). If it is possible to have it say "drake" that would be awesome.

maelle commented 6 years ago

Happy New Year everyone in this thread!

@jules32 Thanks a lot for your review!

I'm trying to find a technical reviewer who'd have time to review drake before Jeroen's available again, I'll update this thread once I know more. Sorry for the long process, @wlandau-lilly , and thanks again for your work on the package since the submission!

wlandau-lilly commented 6 years ago

Thank you, @jules32! I am eager to address your comments, and I expect to have a proper response within the next couple days. For now, I will comment on the installation issue you mentioned, which I believe I addressed just now via https://github.com/wlandau-lilly/drake/commit/16a4ea5dcd8017dbe2c756e3e4d301565f3fd4ee. The Ecdat package is only required for a vignette, so I listed it to the Suggests: field of the DESCRIPTION. Rather than move it to Imports: or Depends:, I simply removed build = TRUE from the call to install_github(). (In hindsight, build = TRUE seems excessive anyway.) I also added special instructions for building the vignettes.

maelle commented 6 years ago

@wlandau-lilly I've contacted several potential technical reviewers to see if they could review this package rapidly, without success which is maybe not surprising at this time of the year after the holidays. A non rapid review would be the usual 3 weeks which is not much shorter than one month. I therefore propose we wait for @jeroen's review, with a new due date, 2018-02-04. I'm very sorry about that!

wlandau-lilly commented 6 years ago

@maelle I was hoping to complete this before rstudio::conf(2018), but I guess it can't be helped. And I realize that the winter holidays are not the right time to do work. Thank you for trying.

jeronjacob commented 6 years ago

I think you may have notified me accidentally. I am not involved in this project.

wlandau-lilly commented 6 years ago

Sorry, @jeronjacob. Feel free to unsubscribe from this thread.

wlandau-lilly commented 6 years ago

Response to @jules32's January 3 partial review

Your feedback and advice are extremely helpful, and your encouragement is gratifying. I want to reach as many new users as possible, so I care a lot about outside feedback on the documentation. Thank you for your efforts.

I agree with all your suggestions from January 3, and I believe I addressed them all in https://github.com/wlandau-lilly/drake/commit/16a4ea5dcd8017dbe2c756e3e4d301565f3fd4ee through https://github.com/wlandau-lilly/drake/commit/f76f6e2127ac8003a8c5417a667ae9b9141ae15a. Please let me know if you think I missed anything.

A big question I'm still left with after the README and the (super-helpful) website is what the functions that users will use as they get started, and over and over again. We have seen make and drake_config, but that's after a lot of other drake magic has gone on behind the scenes. Would it be possible/desirable to make a list (and linking to the website's reference page)? I know you're trying to cut down the README so some of this could go on the website's Get Started page perhaps.

The Documentation section of the README now includes a list of the 10 most important functions, given roughly in the order I expect a user to call them, and it also refers to the reference section of the documentation website. The documentation website's main page has an identical section.

JL: I don't think it's necessary to include options to install different tags/releases from GitHub. I think if a user wants that, they'll know where to look.

Done. Like the rest of the changes to the README, this change is also reflected on the documentation website.

For packages co-submitting to JOSS

JL: from the above thread it seems like this is in progress so I will wait to evaluate this

I have just begun my company's scientific disclosure process to release my JOSS manuscript. It is essentially the "Why use drake?" section of the README with some added metadata and references.

Having the Sisyphean loop example 1-4 is great: it's really helpful. It seems that with drake, this turns into something like:

  1. Launch the code
  2. Drake evaluates and rebuilds anything that has changed since the last run through

Is that true, and would that we worth itemizing like that in the README?

Yes, absolutely. The continuity of the format highlights the contrast between approaches. I made the change.

And then here are some suggestions for commenting the example, which is a bit obvious but can be a bit easier to follow: ...

Narration is definitely helpful here. I have added very similar comments to the code.

Something is wrong on the Reference page; instead of a short descriptor of the function, it repeats the function name after the word "Function". This was also the case in the R help for v. 4.4.1.9000.

That was just my own laziness. I went back and changed all the titles to be informative. The new reference page and help files are now improved.

Get Started page:

  • I'd suggest starting this page with the "Where to Begin" part since the rest of it is on the homepage.

Come to think of it, the "Where to begin" section is the only unique part of the Get Started page. I have now removed all the sections that were repeated on the main page and README. You can also see the changes in the underlying drake.Rmd vignette.

Small thing, but when I look at the website on Chrome, in the tab it is labeled "Data Frames in R for Make — Drake". I know that that is the origin of the name, but I agree with @maelle's comment above that it's not intuitive (especially as a person who has too many tabs open all the time). If it is possible to have it say "drake" that would be awesome.

Done. The title you saw was automatically generated by pkgdown. The change required some more post-processing, but I agree that it was necessary.

wlandau-lilly commented 6 years ago

FYI: I just added the JOSS submission docs in https://github.com/wlandau-lilly/drake/commit/1c9c67b6495330d5674315bc019e5473d0e7a4ab. You can compile the pdf with pandoc --bibliography paper.bib paper.md -o paper.pdf. Since a DOI for a GitHub repo needs a tag/release, I will wait to generate one for drake until all the reviewers here approve the rest of the package.

wlandau-lilly commented 6 years ago

Also, I would like the DOI to correspond to a version of drake whose logo and links refer to https://github.com/ropensci/drake (see also https://github.com/wlandau-lilly/drake/pull/176) and the CRAN release of version 5.0.0.