wlandau-lilly commented 7 years ago

Summary

The drake package is an R-focused pipeline toolkit. It reproducibly brings results up to date and automatically arranges computations into successive parallelizable stages. It has a Tidyverse-friendly front-end, powerful interactive visuals, and a vast arsenal of multicore and distributed computing backends.

Package: drake
Title: Data Frames in R for Make
Version: 4.4.1.9000
Authors@R: c(
  person(
    family = "Landau",
    given = c("William", "Michael"),
    email = "will.landau@lilly.com",
    role = c("aut", "cre")),
  person(
    family = "Axthelm",
    given = "Alex",
    email = "aaxthelm@che.IN.gov",
    role = "ctb"),
  person(
    family = "Clarkberg",
    given = "Jasper",
    email = "jasper@clarkberg.org",
    role = "ctb"),
  person(
    family = "Eli Lilly and Company",
    role = "cph"))
Description: A solution for reproducible code and 
  high-performance computing.
License: GPL-3
Depends:
  R (>= 3.2.0)
Imports:
  codetools,
  crayon,
  eply,
  evaluate,
  digest,
  formatR,
  future,
  grDevices,
  igraph,
  knitr,
  lubridate,
  magrittr,
  parallel,
  plyr,
  R.utils,
  rprojroot,
  stats,
  storr (>= 1.1.0),
  stringi,
  stringr,
  testthat,
  utils,
  visNetwork,
  withr
Suggests: 
  abind,
  DBI,
  future.batchtools,
  MASS,
  methods,
  RSQLite,
  rmarkdown,
  tibble
VignetteBuilder: knitr
URL: https://github.com/wlandau-lilly/drake
BugReports: https://github.com/wlandau-lilly/drake/issues
RoxygenNote: 6.0.1

URL: https://github.com/wlandau-lilly/drake
Fit: drake falls easily within reproducibility and high-performance computing.
Target audience: anyone who uses R for medium-to-long computations for which the results need to stay up to date with the dependencies.

Similar work

Remake

Drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and @richfitz, and explicit acknowledgements are in the documentation. However, drake surpasses remake in several important ways, including but not limited to the following.

High-performance computing. Remake has no native parallel computing support. Drake, on the other hand, has a vast arsenal of parallel computing options, from local multicore computing to serious distributed computing. Thanks to future, future.batchtools, and batchtools, it is straightforward to configure a drake project for most popular job schedulers, such as SLURM, TORQUE, and the Sun/Univa Grid Engine, as well as systems contained in Docker images.
A friendly interface. In remake, the user must manually write a YAML configuration file to arrange the steps of a workflow. In drake, this configuration is based on data frames that built-in wildcard templating functionality easily generates at scale.
Thorough documentation. Drake contains eight vignettes, a comprehensive README, examples in the help files of user-side functions, and accessible example code that users can write with drake::example_drake().
Active maintenance. Drake is actively developed and maintained, and issues are usually solved promptly.

Factual's drake

Factual's drake is similar in concept, but the development effort is completely unrelated to the R package of the same name.

Other pipeline toolkits

There are many other successful pipeline toolkits, and the drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and parallel computing flexibility.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package has an obvious research application according to JOSS's definition.
- [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)

I plan to submit to JOSS in the future, but the manuscript is not currently ready.

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

maelle commented 6 years ago

@jules32 recruited a technical reviewer! :tada: Thanks a ton! Welcome, @gothub, and thanks a lot! Is 2018/01/22 a good deadline for your review?

Here is our review template and our reviewing guide

wlandau-lilly commented 6 years ago

Yes, thank you @gothub!

By the way: @benmarwick, do you think I sufficiently addressed your review? I made several improvements since my first response on December 9.

gothub commented 6 years ago

@maelle yes, I'll have my review in by 2018/01/22.

maelle commented 6 years ago

Fantastic, thanks for being so flexible!

gothub commented 6 years ago

@maelle here is my review:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[p] A statement of need clearly stating problems the software is designed to solve and its target audience in README The purpose of the package is clearly stated. I am not a statistician but rather a software engineer who has worked with environmental scientists and computational ecologists, so my evaluation is based on that perspective. Therefore I may not fully verify the claims made about the usefulness of this software for the target audience.
[x] Installation instructions: for the development version of package and any non-standard dependencies in README There were quite a few 'suggested' packages, but the software is well designed to use the standard R methods to detect and use optional packages if available.
[x] Vignette(s) demonstrating major functionality that runs successfully locally The vignettes are clearly written and appear to describe the package functionality throughly. The 'drake' vignette provides an overview of the package, but I believe that the 'quickstart' vignette is also required to read in order to understand the scope of the package. The link in the 'drake' vignette to the 'quickstart' points to the documentation website, although the equivalent 'quickstart' vignette is included with the R package. I'm not familiar with the types of workflows that the package processes, where dataset replicates and analyses are automatically generated ('quickstart' vignette, section 4), so I'm not able to fully evaluate these vignettes. The 'basic example' (i.e. from 'load_basic_example') is used throughout the documentation, so the basic setup of a 'template' workflow is not described fully in the vignettes or R command documentation. there are also other examples available that are more detailed and these can be listed with 'drake_examples()'. It would be very useful to provide a list of these examples in this vignette with a very brief description of the examples. I found these additional examples very helpful (to be more helpful than then the basic example) as they used actual data with a processing goal explained.
[x] Function Documentation: for all exported functions in R help The function documentation is complete and detailed with examples. Some of the examples are not run (by roxygen). Does devtools::check() evaluate these examples?
[p] Examples for all exported functions in R Help that run successfully locally As mentioned, some of the examples are not run (i.e. '\dontrun'), and I have not searched for and tested all 'not run' examples. I ran checks manually for some of the functions, without errors: drake_config(), drake_plan, make(), outdated()
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented. The package installed simply with 'install.packages()' and did not require any additional setup or configuration.
[p] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed. The parallel execution functionality appears to be powerful and is clearly explained but I have not run tests to confirmed processing time gains when using the parallel backends vs serial processing. If the parallel execution functionality is important to rOpenSci for inclusion of this package, then it may be useful to include a long running example for use as a benchmark with execution times shown for each backend and for serial processing. I'm not suggesting that this is necessary, but this could be suggested by the editor if deemed a requirement.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine. The unit tests ran successfully using devtools::test(). The tests appear to be well constructed and extensive.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

The documentation makes several statements that this package supports "reproducibility", and even mentions that the reproducibility supported is different than scientific replicability. The meaning of this term as it relates to the package functionality is not clear to me. In my experience at NCEAS this term generally means that source data, processing artifacts and documentation of a workflow are persisted and available so that they can be obtained by a researcher who could replicate any processing products on their own, without any guidance from the original researcher. Does 'drake' somehow assist in this type of reproducibility?

An alternative use of reproducibility might be for an individual researcher (or team) to always expect consistent results even when any of their source data or processing steps have been modified. Is this the claim that the documentation is making? If a processing workflow is configured in drake then consistent results will always be generated? If that is the case, then how is this more reliable than running all scripts manually, even though this may be inefficient? I'm not trying to play devil's advocate here, just trying to clarify and understand the statements made in the documentation. In working with environmental scientists and ecologists for several years, I have found that the definition of reproducibility is fairly specific, so elaborating on the use of this term for the package may bring more clarity.

gothub commented 6 years ago

@wlandau-lilly the example descriptions in drake.Rmd and README look great.

wlandau-lilly commented 6 years ago

Thanks, Peter! I hope to have a proper response to your review soon.

wlandau-lilly commented 6 years ago

Response to @gothub's review

I sincerely appreciate your timely and thorough review! I found your feedback extremely helpful, and I have responded to specific comments below.

There were quite a few 'suggested' packages

Yes, most of these optional packages support the vignettes, examples, custom caches, and advanced high-performance computing options.

The link in the 'drake' vignette to the 'quickstart' points to the documentation website, although the equivalent 'quickstart' vignette is included with the R package.

I have tweaked the beginning of the drake.Rmd vignette so it links to both the quickstart vignette source and the rendered online version.

The 'basic example' (i.e. from 'load_basic_example') is used throughout the documentation, so the basic setup of a 'template' workflow is not described fully in the vignettes or R command documentation.

I know it is buried very deep, but the example from load_basic_example() is actually paired with its own saved workflow. The code files are located here, referenced here, and generated by drake_example("basic"). The R script sets up the code and carries out a heavily-commented tutorial that walks through some of drake's main features. I do agree that I should make all this more available, so I have added extra references to the vignettes and examples. Now, wherever load_basic_example() is called, there is a comment that says # Get the code with drake_example("basic").

there are also other examples available that are more detailed and these can be listed with 'drake_examples()'. It would be very useful to provide a list of these examples in this vignette with a very brief description of the examples. I found these additional examples very helpful (to be more helpful than then the basic example) as they used actual data with a processing goal explained.

There is a listing of examples here, but again, it is buried and far from obvious. I have replicated this list in a new section of the drake.Rmd vignette, and the documentation section of the README now describes the beginner-oriented examples.

The function documentation is complete and detailed with examples. Some of the examples are not run (by roxygen). Does devtools::check() evaluate these examples?

I chose to suppress many of the examples because some are time-consuming and others require the optional packages. And yes, devtools::check() runs the examples, and it skips the code enclosed in \dontrun{}. You can run all the code, even the code inside \dontrun{}, with the run argument of devtools::run_examples(). I routinely check devtools::run_examples(run = TRUE) and devtools::run_examples(run = FALSE).

The parallel execution functionality appears to be powerful and is clearly explained but I have not run tests to confirmed processing time gains when using the parallel backends vs serial processing. If the parallel execution functionality is important to rOpenSci for inclusion of this package, then it may be useful to include a long running example for use as a benchmark with execution times shown for each backend and for serial processing. I'm not suggesting that this is necessary, but this could be suggested by the editor if deemed a requirement.

You bring up an important point, and I do plan to do serious performance studies in the future. @maelle, do you think we need a long-running example for rOpenSci?

The documentation makes several statements that this package supports "reproducibility", and even mentions that the reproducibility supported is different than scientific replicability. The meaning of this term as it relates to the package functionality is not clear to me. In my experience at NCEAS this term generally means that source data, processing artifacts and documentation of a workflow are persisted and available so that they can be obtained by a researcher who could replicate any processing products on their own, without any guidance from the original researcher. Does 'drake' somehow assist in this type of reproducibility?

I am so glad you brought this up. Yes, the data need to be available and the documentation is important. But without a tool like drake, we don't know if a project is actually reproducible until we try to recompute the results from the beginning. That usually takes a ton of time, effort, and digging. On the other hand, if outdated() says that all the targets are up to date, we get tangible evidence for free. We know someone else could reproduce the results from the inputs because drake says they are synced. We know that the code and data have not changed meaningfully since the last time the results were generated. This quick automated check is extremely valuable when the alternative is so painstaking and difficult.

Other reproducibility advantages:

The parallel computing is implicit and automatic. It is not tangled up with the code that actually does the analysis. This makes the code highly portable, re-scalable, and in my experience, much easier to read than otherwise.
vis_drake_graph() provides big-picture guidance as to how the pieces of the analysis fit together.
Re-creating the results should always be a simple matter of calling make().

An alternative use of reproducibility might be for an individual researcher (or team) to always expect consistent results even when any of their source data or processing steps have been modified. Is this the claim that the documentation is making?

Not exactly. Drake is about detecting when the data/processing has been modified, when it has not, and what to do in each eventuality.

If a processing workflow is configured in drake then consistent results will always be generated? If that is the case, then how is this more reliable than running all scripts manually, even though this may be inefficient?

For large projects, running all the scripts manually is difficult enough for drake to make a huge practical difference. I strongly believe that checking internal consistency early and often is super important for reproducibility.

You brought up another kind of consistency that I find interesting. If we repeat the analysis from scratch a large number of times, will we always get the same final answer? For stochastic methods like MCMC, this is another important consideration for reproducibility. However, it is outside the scope of drake. Drake does make some effort to preserve random seeds (see wlandau-lilly/drake#56 and wlandau-lilly/drake#109) but that hardly scratches the surface of the issue.

HenrikBengtsson commented 6 years ago

My $0.02 comment on:

"... include a long running example for use as a benchmark with execution times shown for each backend and for serial processing ..."

It'll be impossible to find a one-size-fits-all use case that will be representative and that can be used to produce benchmark statistics that can be used as a guidance for a random user. Each user/project will have its own use case/pipeline to benchmark - I think that's what should be benchmarked. And after the first round of parallelization comes tweaking/optimization of how to best structure the parallelization of your analysis - and there are often multiple ways. Factors such as amount of RAM, size of files processed, other tenants running on the same system at the same time, and so on, also come in play.

So, I would be careful making benchmark claims for large workflows, especially since such "results" tend to stick in the community as the ultimate facts/truth of a particular tool. It's easier to benchmark a very specific problems, but Drake is in the business of being a generic workflow manager with automatic parallelization as a feature.

wlandau-lilly commented 6 years ago

Your comments are gratifying, Henrik. I still plan to study performance and optimization at some point, but for right now, you are making my job seem less daunting.

When it comes to parallel computing, I claim that drake processes and caches targets in parallel (in stages) and in the correct order. I also claim that the work deploys to the user's parallel backend of choice. With the function in_progress() (or equivalently, progress() or vis_drake_graph()) and your system's job monitoring software, I think this should all be straightforward to verify for one of the built-in examples if you think there is still a need. The basic example is probably easiest because of load_basic_example(). You may need to artificially slow down the workflow, maybe by appending "; sleep(60)" to all the commands or making the input datasets large enough.

Drake's contribution is figuring out the parallel arrangement of targets automatically. For example, a machine learning expert need not worry that the neural nets depend on the training data. Drake knows that the data must be processed first, and it knows which neural nets can be computed at the same time. Beyond that, it borrows its power from other tools. The storr, parallel, future, future.apply, future.batchtools, and batchtools packages do the heavy lifting.

Late last year, @kendonB heavily used drake to deploy a massive project to his organization's SLURM cluster (via make(..., parallelism = "future_lapply")). He uncovered a huge batch of performance issues, and drake is a much better package because of his on-target feedback. I fully expect more bottlenecks to arise as more power users use the package in earnest.

wlandau-lilly commented 6 years ago

@gothub, I realized that I did not adequately address your point about the lack of a template workflow. It is difficult to have a template when drake focuses on a project's R session rather than its file system. You need a workflow plan data frame with targets and commands, plus any supporting functions and files, but otherwise, there is a lot of flexibility. The built-in examples are probably the best reference.

wlandau-lilly commented 6 years ago

Also, I rephrased my justification of reproducibility. I struggle to explain it because it is a nontraditional interpretation of the idea, at least in the statistics and broader scientific communities.

Yes, the data need to be available and the documentation is important. But without a tool like drake, we don't know if a project is actually reproducible until we try to recompute the results from the beginning. That usually takes a ton of time, effort, and digging. On the other hand, if outdated() says that all the targets are up to date, we get tangible evidence for free. We know someone else could reproduce the results from the inputs because drake says they are synced. We know that the code and data have not changed meaningfully since the last time the results were generated. This quick automated check is extremely valuable when the alternative is so painstaking and difficult.

maelle commented 6 years ago

Thanks a lot @gothub!

Rich discussion here! @wlandau-lilly please make sure to insert all clarifications from this thread into the documentation 😉

When you're done update this thread with a summary of answers to reviewers (a short one per review) in order to make it easier for them to sign off your changes even if they have little time. ☺ good job on using their comments until now!

wlandau-lilly commented 6 years ago

@maelle, in the above two commits, I have inserted the clarifications about reproducibility and high-performance computing in the documentation. From re-reading the thread, I think these are the major clarifications. Are you thinking of any more?

wlandau-lilly commented 6 years ago

Also, I realized that I totally missed the point of one of your earlier questions.

Very naive question, does each command need to be something as basic as summary or could it be sourcing a larger script containing several regression calls?

Commands can be arbitrary code chunks. For example:

my_plan <- drake_plan(
  a = 1 + 1,
  b = {
    x <- pi + a
    y <- sqrt(x)
    rand <- rnorm(10, sd = y)
    mean(rand)
  },
  c = a - 5,
  d = c(b, c)
)

However, a command should NOT just source('large_command.R'). Drake is function-oriented, not file-oriented. I am beginning to see widespread confusion here (for example, wlandau-lilly/drake#193), and I am doing my best to fix it. I have a new best practices guide that spells out the pitfalls, and I plan to add more over the course of future development.

wlandau-lilly commented 6 years ago

Good point about summarizing my responses to reviewers. This is a massive thread.

wlandau-lilly commented 6 years ago

Summary of @wlandau-lilly's responses to reviewers

Drake is large and difficult to review, and I am grateful for all your hard work. Drake is a much better package because of your feedback.

@benmarwick

Changes to `drake`

Add a motivating data analysis question to the basic example (wlandau-lilly/drake#195). There is now a coherent story behind it, so it should do a better job now of helping new users. This changes are reflected in basic example files (both the R script and the report), load_basic_example(), the quickstart vignette, and the drake.Rmd vignette. The documentation changed, but the code behind it needed only the tiniest modifications.
Add a new section to the README that compares drake to similar work: GNU Make, remake, memoise, and knitr.
In this new section, argue that drake surpasses these tools for its intended use case.
List real-world projects that use Make as a reproducible pipeline toolkit.
Reference a list of examples of real-world remake projects.
Add two realistic demonstrations: example-gsp.Rmd and example-packages.Rmd. Both are fabricated, but they use real data to answer real questions. They help users see the practical utility of drake for their own projects.
example-gsp.Rmd shows how drake surpasses Make.
Move CONTRIBUTING.md to the top level for better visibility.
Add a paper.md and paper.bib for JOSS.

Comments

Unfortunately, I cannot yet find publicly-available real-world projects or publications using drake. I think it is still too early.
I have now clarified drake's high-performance computing claims here.

@jules32

Changes to `drake`

Add a motivating data analysis question to the basic example. Details are at wlandau-lilly/drake#195 and in the summarized response @benmarwick's review (above).
Change the installation instructions:
- Remove build = TRUE from install_github() so the "Suggests:" packages such as Ecdat are not required.
- Remove superfluous instructions for installing a GitHub tag/release.
Add a list of useful functions and a link to the documentation website's reference section at the top of the README's "Documentation" section.
Use a numbered list here to show how drake conquers the Sisyphean loop of naive data science.
Comment the code chunk in this section in the way @jules32 suggested.
Add useful titles to functions in the help files and the website's reference section. (Previously, they all said "Function f" or equivalent.)
Change the website tab title from "Data Frames in R for Make - Drake" to just "drake".
Add a paper.md and paper.bib for JOSS.

Comments

Previously, you mentioned that your review was in progress. What are your plans for the remainder?

@gothub

Changes to `drake`

Add a motivating data analysis question to the basic example. Details are at wlandau-lilly/drake#195 and in the summarized response @benmarwick's review (above).
At the top of the drake.Rmd vignette, reference both the quickstart vignette and the analogous quickstart page on the documentation website.
Reference drake_example("basic") wherever load_basic_example() is called in the documentation. That way, users know how to obtain the underlying code.
Expose the annotated listing of drake_examples() here in the README, here in the drake.Rmd vignette, and here in the Get Started page.
Clarify exactly what drake is claiming about its parallel computing functionality: https://github.com/wlandau-lilly/drake/commit/020f4fb7e452765c9e138e07e5f64a28aa83711e. See here in the parallelism vignette and here in the cautionary vignette. Please also see this comment by @HenrikBengtsson.
Expand the "Stay Reproducible" section of the README to concretely spell out drake's role when it comes to reproducibility.

Comments

I have improved and expanded the "Stay Reproducible" section of the README since my last comment on the rOpenSci thread. I hope I have done a better job explaining how drake helps with reproducibility.
With the run argument to devtools::run_examples(), you can override the \dontrun{} command and execute all the code in the examples. I routinely run devtools::run_examples(run = TRUE) and devtools::run_examples(run = FALSE).
Regarding reproducibility, you mentioned consistency, which I took to mean stability. If you run a project from scratch a large number of times, will new results always be consistent with the previous ones? It is an important concern, especially for stochastic methods such as Markov chain Monte Carlo, but it is outside the scope of drake.

maelle commented 6 years ago

@jules32 @benmarwick @gothub are you happy with the changes made?

Thanks @wlandau-lilly for the summary, and great on having used all feedback! 😸

benmarwick commented 6 years ago

Yes, thanks, I'm happy. You've done lots of work to address the concerns in my review, that's excellent to see. The package is much more accessible now, and it's easier to see how to get started using it.

wlandau-lilly commented 6 years ago

Thank you, @benmarwick! I am so glad you think the changes are making a difference.

By the way, I just updated the review summaries in light of wlandau-lilly/drake#195.

gothub commented 6 years ago

All of the points I mentioned have been fully addressed with clear and complete documentation, and I have no other issues, so thumbs up from me.

I have one remaining question that is not really an issue for the review, but I'm just curious about. If i were to develop a workflow with drake, what drake specific artifacts would need to be preserved for a researcher to reproduce my analysis (other than the R scripts and data that may have existed before even starting to build the drake workflow)? I'm assuming that it's simply a CSV from the workflow itself, e.g. 'my_plan.csv' from the basic example. If more than that is required then explaining that in the docs or having an export function would be a good idea.

jules32 commented 6 years ago

Yes, I'm happy too. You've addressed the ideas I had and think that this is good to continue moving forward. Nice work!

wlandau-lilly commented 6 years ago

Thank you so much, @gothub and @jules32! It is exciting to have confirmation from all three reviewers!

@gothub, I really like your question. I will try to break down the answer.

R scripts

As you say, we need any code that writes functions, generates workflow plans, loads packages, and calls make().

The workflow plan

In the way I use drake (and the way I picture others using it), the workflow plan does not actually need to be saved as a csv file. Typically, it is super quick to programmatically generate in an R script before make() is called. I usually keep a plan.R file that uses drake_plan(), plan_analyses(), plan_summaries(), expand_plan(), and evaluate_plan() to generate the plan. The advantage here is that you can typically spin up thousands of targets with as few as 10-20 lines of human-readable code. (With remake or Make, you would have to write it all out by hand, which is often intractable.) Then, I source("plan.R") before calling make().

Input files and other imported objects

Examples include optional report.Rmd files, other knitr reports, user-defined CSV data files, and functions created in the R scripts. It's always a good idea to check for missing input files and imported functions/objects. You could either use the missed() function, or you could use vis_drake_graph() and look for the purple nodes, which indicate missing objects or files.

R packages

Packages are tricker. Maintaining a reproducible package environment is beyond the scope of drake, and the documentation recommends packrat for this purpose (mentioned in the quickstart and caution vignettes). But if you do not use packrat, there is still hope. By default, each make() saves the sessionInfo() to the cache (use the session_info argument to deactivate). An independent researcher just needs to call drake_session() to learn about the required versions of R, packages, and other system requirements.

The cache

Including the drake cache (Usually a hidden .drake/ folder) lets independent researchers use drake_session() to learn about the package environment, and it allows them to call make() to check that everything is up to date (and thus re-creatable). This can be tricky: by design, the drake cache (which is just a storr::storr_rds()) divides its data over lots of tiny files, which is essential to avoid race conditions in parallel computing. However, this setup behaves poorly under version control, and it may even make it a bit annoying to use a shared file system like Dropbox or Google Drive.

For the version control bit, @noamross recently suggested that drake supply a flat text log file to give a list of targets and their fingerprints. That way, the git commit history of the log tracks changes to the targets over time. Right now, drake supports make(..., cache_log_file = "my_log_file.txt"), make(..., cache_log_file = TRUE), and drake_cache_log_file().

wlandau-lilly commented 6 years ago

@benmarwick, @jules32, @gothub, and @maelle, would any of you like to be listed in the DESCRIPTION as reviewers? (person(..., role = "rev")).

maelle commented 6 years ago

Editors shouldn't be listed :-) but thanks for asking!

gothub commented 6 years ago

@wlandau-lilly Thanks for the explanation of drake persistence.

Yes, please include me as a listed reviewer of the package, and thanks!

wlandau-lilly commented 6 years ago

Done, thanks.

wlandau-lilly commented 6 years ago

By the way, an Appveyor build failed just now because listenv is in the middle of an update on CRAN.

  There are binary versions available (and will be installed) but the
  source versions are later:
        binary source
listenv  0.6.0  0.7.0
tibble   1.4.1  1.4.2
Error in download.file(url, destfile, method, mode = "wb", ...) : 
  cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/listenv_0.6.0.zip

In my experience, this kind of error just blows over if I wait a bit. Would that be okay? I ask because my sense is that drake is about to be approved, so the timing is not ideal.

maelle commented 6 years ago

No problem, I'll test it on Windows later this week when I do the last checks!

maelle commented 6 years ago

Approved! Thanks everyone for a productive and (I hope) enjoyable review process!

To-dos

[ ] Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to one team that should allow you to do so.
[ ] Fix any links in DESCRIPTION, and in badges for CI and coverage to point to the ropensci URL. (we'll turn on the services on our end.)
[ ] Activate Zenodo watching the repo
[ ] Tag and create a release so as to create a Zenodo version and DOI
[ ] Submit to JOSS using the Zenodo DOI. We will tag it for expedited review.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). Let me know if you are interested, I'd ping rOpenSci community manager.

maelle commented 6 years ago

I forgot to mention R CMD Check passed without any issue on my laptop on Windows. You might just need to justify having "rev" roles in cran-comments.md (but then it gets accepted because that's a valid use of the role).

wlandau-lilly commented 6 years ago

Wow, @maelle! Thank you so much! This has been an extremely rewarding and gratifying review process, and I am super grateful for your hard work!

One note before I begin the transition: now that drake is moving to rOpenSci, I want to take this opportunity to shift my open source work to my personal account, @wlandau. My colleagues and I agree this is for the best. Would you invite @wlandau to rOpenSci?

maelle commented 6 years ago

Done!

wlandau commented 6 years ago

Wonderful! I transferred the repo, which was super exciting. One small thing: I do not have access to the "Settings" tab, and I cannot change the link to the documentation website at the top. Am I just over-eager, or is there another step?

capture

wlandau commented 6 years ago

I think I found the problem, and it is probably a common issue. @maelle, would you add me as a maintainer at https://github.com/orgs/ropensci/teams/drake/members?

maintainer

maelle commented 6 years ago

Done! 😸

wlandau commented 6 years ago

Fantastic! Now, all the links reference rOpenSci rather than @wlandau-lilly. Before I activate Zenodo and submit to JOSS, I will make my final preparations for the release of 5.0.0 (soon to go to CRAN).

I have prepared a narrative-form blog post, and I am excited to share it. Is there somewhere I can submit a PR?

Also, could I append an rOpenSci footer to the README?

maelle commented 6 years ago

Pinging @stefaniebutland

wlandau commented 6 years ago

Also, before I create a release for Zenodo, I think I should confirm with @jules32 and @benmarwick about being listed as reviewers in the package itself.

jules32 commented 6 years ago

Hi All,

This is exciting to see happen! I am fine not being listed as a reviewer; I hope that my suggestions were helpful but, as they were minimal, I don't think I need to be included in the DESCRIPTION.

But thank you for asking!

On Thu, Jan 25, 2018 at 12:18 PM, Will Landau notifications@github.com wrote:

Also, before I create a release for Zenodo, I think I should confirm with @jules32 https://github.com/jules32 and @benmarwick https://github.com/benmarwick about being listed as reviewers https://github.com/ropensci/drake/blob/649780c2cbb0a7f1241b48926edf588d7c78bc05/DESCRIPTION#L28 in the package itself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/156#issuecomment-360587983, or mute the thread https://github.com/notifications/unsubscribe-auth/AFnnRROmPRbejeJQ2_Oh_vVDRJpGJHCcks5tOOGwgaJpZM4QYbv- .

--

Julia Stewart Lowndes, PhD Ocean Health Index National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara (UCSB) website http://jules32.github.io/ • ohi-science http://ohi-science.org/ • github https://github.com/jules32 • twitter https://twitter.com/juliesquid

wlandau commented 6 years ago

Thanks for letting me know, Julie. Your suggestions absolutely helped, and I am glad you were part of this. Please let me know if you change your mind.

maelle commented 6 years ago

@jules32 as an editor I found your review very good and will most certainly try and recruit you again!

stefaniebutland commented 6 years ago

Hello @wlandau-lilly. I'm rOpenSci's Community Manager.

I have prepared a narrative-form blog post, and I am excited to share it. Is there somewhere I can submit a PR? We'd love to have a post about this. Please submit a PR according to: https://github.com/ropensci/roweb2#contributing-a-blog-post. I will review it, provide some feedback and a suggested date to publish it.

Thank you!

wlandau commented 6 years ago

Thanks, @stefaniebutland! I just submitted ropensci/roweb2#140. Really looking forward to the next steps.

wlandau commented 6 years ago

@maelle, I have added @benmarwick as a reviewer in the DESCRIPTION (permission granted via email), and I have completed the 5 to-dos from here. I also submitted drake 5.0.0 to CRAN, added an rOpenSci footer to the README, and volunteered as a reviewer. This is all super exciting!

Besides the separate processes for the blog entry and the JOSS paper, there are a couple minor loose ends that will probably resolve themselves.

AppVeyor CI is still deactivated.
The rOpenSci badge still says "Under Review" in the GitHub README, but the same badge says "Peer Reviewed" on the documentation website.

maelle commented 6 years ago

Now activated, here is the badge code [![Build status](https://ci.appveyor.com/api/projects/status/4ypc9xnmqt70j94e?svg=true)](https://ci.appveyor.com/project/ropensci/drake)
Yes they take a while to propagate, but tell us if it's still the case on Monday for instance.

wlandau commented 6 years ago

Thanks, @maelle. The AppVeyor builds appear to be working. I will check back on the badge in a few days.

wlandau commented 6 years ago

Another minor thing: I am getting emails like this one on every commit:

Build ropensci/drake@810b97d5ba successful: https://ropensci.ocpu.io/drake/

NEW COMMITS: [810b97d5ba] Time: 2018-01-26 08:39:16 ...

Is there a way to attenuate them? Are the other watchers receiving these emails, or just the maintainers? I do not see an OpenCPU webhook for drake, so I am guessing it is at the org level.

jeroen commented 6 years ago

Those emails are from the CI system, they are sent to the pusher only. You can ignore/filter them.

wlandau commented 6 years ago

Good enough for me, thanks @jeroen.

ropensci / software-review

drake (R package) #156

Summary

Similar work

Remake

Factual's drake

Other pipeline toolkits

Requirements

Publication options

Detail

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Response to @gothub's review

Summary of @wlandau-lilly's responses to reviewers

@benmarwick

Changes to `drake`

Comments

@jules32

Changes to `drake`

Comments

@gothub

Changes to `drake`

Comments

R scripts

The workflow plan

Input files and other imported objects

R packages

The cache

ropensci / software-review

drake (R package) #156

Summary

Similar work

Remake

Factual's drake

Other pipeline toolkits

Requirements

Publication options

Detail

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Response to @gothub's review

Summary of @wlandau-lilly's responses to reviewers

@benmarwick

Changes to drake

Comments

@jules32

Changes to drake

Comments

@gothub

Changes to drake

Comments

R scripts

The workflow plan

Input files and other imported objects

R packages

The cache

Changes to `drake`

Changes to `drake`

Changes to `drake`