Closed wlandau-lilly closed 6 years ago
@jules32 recruited a technical reviewer! :tada: Thanks a ton! Welcome, @gothub, and thanks a lot! Is 2018/01/22 a good deadline for your review?
Here is our review template and our reviewing guide
Yes, thank you @gothub!
By the way: @benmarwick, do you think I sufficiently addressed your review? I made several improvements since my first response on December 9.
@maelle yes, I'll have my review in by 2018/01/22.
Fantastic, thanks for being so flexible!
@maelle here is my review:
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
[p] A statement of need clearly stating problems the software is designed to solve and its target audience in README The purpose of the package is clearly stated. I am not a statistician but rather a software engineer who has worked with environmental scientists and computational ecologists, so my evaluation is based on that perspective. Therefore I may not fully verify the claims made about the usefulness of this software for the target audience.
[x] Installation instructions: for the development version of package and any non-standard dependencies in README There were quite a few 'suggested' packages, but the software is well designed to use the standard R methods to detect and use optional packages if available.
[x] Vignette(s) demonstrating major functionality that runs successfully locally The vignettes are clearly written and appear to describe the package functionality throughly. The 'drake' vignette provides an overview of the package, but I believe that the 'quickstart' vignette is also required to read in order to understand the scope of the package. The link in the 'drake' vignette to the 'quickstart' points to the documentation website, although the equivalent 'quickstart' vignette is included with the R package. I'm not familiar with the types of workflows that the package processes, where dataset replicates and analyses are automatically generated ('quickstart' vignette, section 4), so I'm not able to fully evaluate these vignettes. The 'basic example' (i.e. from 'load_basic_example') is used throughout the documentation, so the basic setup of a 'template' workflow is not described fully in the vignettes or R command documentation. there are also other examples available that are more detailed and these can be listed with 'drake_examples()'. It would be very useful to provide a list of these examples in this vignette with a very brief description of the examples. I found these additional examples very helpful (to be more helpful than then the basic example) as they used actual data with a processing goal explained.
[x] Function Documentation: for all exported functions in R help The function documentation is complete and detailed with examples. Some of the examples are not run (by roxygen). Does devtools::check() evaluate these examples?
[p] Examples for all exported functions in R Help that run successfully locally As mentioned, some of the examples are not run (i.e. '\dontrun'), and I have not searched for and tested all 'not run' examples. I ran checks manually for some of the functions, without errors: drake_config(), drake_plan, make(), outdated()
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).
For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 5
The documentation makes several statements that this package supports "reproducibility", and even mentions that the reproducibility supported is different than scientific replicability. The meaning of this term as it relates to the package functionality is not clear to me. In my experience at NCEAS this term generally means that source data, processing artifacts and documentation of a workflow are persisted and available so that they can be obtained by a researcher who could replicate any processing products on their own, without any guidance from the original researcher. Does 'drake' somehow assist in this type of reproducibility?
An alternative use of reproducibility might be for an individual researcher (or team) to always expect consistent results even when any of their source data or processing steps have been modified. Is this the claim that the documentation is making? If a processing workflow is configured in drake then consistent results will always be generated? If that is the case, then how is this more reliable than running all scripts manually, even though this may be inefficient? I'm not trying to play devil's advocate here, just trying to clarify and understand the statements made in the documentation. In working with environmental scientists and ecologists for several years, I have found that the definition of reproducibility is fairly specific, so elaborating on the use of this term for the package may bring more clarity.
@wlandau-lilly the example descriptions in drake.Rmd and README look great.
Thanks, Peter! I hope to have a proper response to your review soon.
I sincerely appreciate your timely and thorough review! I found your feedback extremely helpful, and I have responded to specific comments below.
There were quite a few 'suggested' packages
Yes, most of these optional packages support the vignettes, examples, custom caches, and advanced high-performance computing options.
The link in the 'drake' vignette to the 'quickstart' points to the documentation website, although the equivalent 'quickstart' vignette is included with the R package.
I have tweaked the beginning of the drake.Rmd vignette so it links to both the quickstart vignette source and the rendered online version.
The 'basic example' (i.e. from 'load_basic_example') is used throughout the documentation, so the basic setup of a 'template' workflow is not described fully in the vignettes or R command documentation.
I know it is buried very deep, but the example from load_basic_example()
is actually paired with its own saved workflow. The code files are located here, referenced here, and generated by drake_example("basic")
. The R script sets up the code and carries out a heavily-commented tutorial that walks through some of drake
's main features. I do agree that I should make all this more available, so I have added extra references to the vignettes and examples. Now, wherever load_basic_example()
is called, there is a comment that says # Get the code with drake_example("basic")
.
there are also other examples available that are more detailed and these can be listed with 'drake_examples()'. It would be very useful to provide a list of these examples in this vignette with a very brief description of the examples. I found these additional examples very helpful (to be more helpful than then the basic example) as they used actual data with a processing goal explained.
There is a listing of examples here, but again, it is buried and far from obvious. I have replicated this list in a new section of the drake.Rmd
vignette, and the documentation section of the README now describes the beginner-oriented examples.
The function documentation is complete and detailed with examples. Some of the examples are not run (by roxygen). Does devtools::check() evaluate these examples?
I chose to suppress many of the examples because some are time-consuming and others require the optional packages. And yes, devtools::check()
runs the examples, and it skips the code enclosed in \dontrun{}
. You can run all the code, even the code inside \dontrun{}
, with the run
argument of devtools::run_examples()
. I routinely check devtools::run_examples(run = TRUE)
and devtools::run_examples(run = FALSE)
.
The parallel execution functionality appears to be powerful and is clearly explained but I have not run tests to confirmed processing time gains when using the parallel backends vs serial processing. If the parallel execution functionality is important to rOpenSci for inclusion of this package, then it may be useful to include a long running example for use as a benchmark with execution times shown for each backend and for serial processing. I'm not suggesting that this is necessary, but this could be suggested by the editor if deemed a requirement.
You bring up an important point, and I do plan to do serious performance studies in the future. @maelle, do you think we need a long-running example for rOpenSci?
The documentation makes several statements that this package supports "reproducibility", and even mentions that the reproducibility supported is different than scientific replicability. The meaning of this term as it relates to the package functionality is not clear to me. In my experience at NCEAS this term generally means that source data, processing artifacts and documentation of a workflow are persisted and available so that they can be obtained by a researcher who could replicate any processing products on their own, without any guidance from the original researcher. Does 'drake' somehow assist in this type of reproducibility?
I am so glad you brought this up. Yes, the data need to be available and the documentation is important. But without a tool like drake
, we don't know if a project is actually reproducible until we try to recompute the results from the beginning. That usually takes a ton of time, effort, and digging. On the other hand, if outdated()
says that all the targets are up to date, we get tangible evidence for free. We know someone else could reproduce the results from the inputs because drake
says they are synced. We know that the code and data have not changed meaningfully since the last time the results were generated. This quick automated check is extremely valuable when the alternative is so painstaking and difficult.
Other reproducibility advantages:
vis_drake_graph()
provides big-picture guidance as to how the pieces of the analysis fit together.make()
.An alternative use of reproducibility might be for an individual researcher (or team) to always expect consistent results even when any of their source data or processing steps have been modified. Is this the claim that the documentation is making?
Not exactly. Drake
is about detecting when the data/processing has been modified, when it has not, and what to do in each eventuality.
If a processing workflow is configured in drake then consistent results will always be generated? If that is the case, then how is this more reliable than running all scripts manually, even though this may be inefficient?
For large projects, running all the scripts manually is difficult enough for drake
to make a huge practical difference. I strongly believe that checking internal consistency early and often is super important for reproducibility.
You brought up another kind of consistency that I find interesting. If we repeat the analysis from scratch a large number of times, will we always get the same final answer? For stochastic methods like MCMC, this is another important consideration for reproducibility. However, it is outside the scope of drake
. Drake
does make some effort to preserve random seeds (see wlandau-lilly/drake#56 and wlandau-lilly/drake#109) but that hardly scratches the surface of the issue.
My $0.02 comment on:
"... include a long running example for use as a benchmark with execution times shown for each backend and for serial processing ..."
It'll be impossible to find a one-size-fits-all use case that will be representative and that can be used to produce benchmark statistics that can be used as a guidance for a random user. Each user/project will have its own use case/pipeline to benchmark - I think that's what should be benchmarked. And after the first round of parallelization comes tweaking/optimization of how to best structure the parallelization of your analysis - and there are often multiple ways. Factors such as amount of RAM, size of files processed, other tenants running on the same system at the same time, and so on, also come in play.
So, I would be careful making benchmark claims for large workflows, especially since such "results" tend to stick in the community as the ultimate facts/truth of a particular tool. It's easier to benchmark a very specific problems, but Drake is in the business of being a generic workflow manager with automatic parallelization as a feature.
Your comments are gratifying, Henrik. I still plan to study performance and optimization at some point, but for right now, you are making my job seem less daunting.
When it comes to parallel computing, I claim that drake
processes and caches targets in parallel (in stages) and in the correct order. I also claim that the work deploys to the user's parallel backend of choice. With the function in_progress()
(or equivalently, progress()
or vis_drake_graph()
) and your system's job monitoring software, I think this should all be straightforward to verify for one of the built-in examples if you think there is still a need. The basic example is probably easiest because of load_basic_example()
. You may need to artificially slow down the workflow, maybe by appending "; sleep(60)" to all the commands or making the input datasets large enough.
Drake
's contribution is figuring out the parallel arrangement of targets automatically. For example, a machine learning expert need not worry that the neural nets depend on the training data. Drake
knows that the data must be processed first, and it knows which neural nets can be computed at the same time. Beyond that, it borrows its power from other tools. The storr
, parallel
, future
, future.apply
, future.batchtools
, and batchtools
packages do the heavy lifting.
Late last year, @kendonB heavily used drake
to deploy a massive project to his organization's SLURM cluster (via make(..., parallelism = "future_lapply")
). He uncovered a huge batch of performance issues, and drake
is a much better package because of his on-target feedback. I fully expect more bottlenecks to arise as more power users use the package in earnest.
@gothub, I realized that I did not adequately address your point about the lack of a template workflow. It is difficult to have a template when drake
focuses on a project's R session rather than its file system. You need a workflow plan data frame with targets and commands, plus any supporting functions and files, but otherwise, there is a lot of flexibility. The built-in examples are probably the best reference.
Also, I rephrased my justification of reproducibility. I struggle to explain it because it is a nontraditional interpretation of the idea, at least in the statistics and broader scientific communities.
Yes, the data need to be available and the documentation is important. But without a tool like drake, we don't know if a project is actually reproducible until we try to recompute the results from the beginning. That usually takes a ton of time, effort, and digging. On the other hand, if outdated() says that all the targets are up to date, we get tangible evidence for free. We know someone else could reproduce the results from the inputs because drake says they are synced. We know that the code and data have not changed meaningfully since the last time the results were generated. This quick automated check is extremely valuable when the alternative is so painstaking and difficult.
Thanks a lot @gothub!
Rich discussion here! @wlandau-lilly please make sure to insert all clarifications from this thread into the documentation 😉
When you're done update this thread with a summary of answers to reviewers (a short one per review) in order to make it easier for them to sign off your changes even if they have little time. ☺ good job on using their comments until now!
@maelle, in the above two commits, I have inserted the clarifications about reproducibility and high-performance computing in the documentation. From re-reading the thread, I think these are the major clarifications. Are you thinking of any more?
Also, I realized that I totally missed the point of one of your earlier questions.
Very naive question, does each command need to be something as basic as summary or could it be sourcing a larger script containing several regression calls?
Commands can be arbitrary code chunks. For example:
my_plan <- drake_plan(
a = 1 + 1,
b = {
x <- pi + a
y <- sqrt(x)
rand <- rnorm(10, sd = y)
mean(rand)
},
c = a - 5,
d = c(b, c)
)
However, a command should NOT just source('large_command.R')
. Drake
is function-oriented, not file-oriented. I am beginning to see widespread confusion here (for example, wlandau-lilly/drake#193), and I am doing my best to fix it. I have a new best practices guide that spells out the pitfalls, and I plan to add more over the course of future development.
Good point about summarizing my responses to reviewers. This is a massive thread.
Drake
is large and difficult to review, and I am grateful for all your hard work. Drake
is a much better package because of your feedback.
drake
load_basic_example()
, the quickstart vignette, and the drake.Rmd vignette. The documentation changed, but the code behind it needed only the tiniest modifications.drake
to similar work: GNU Make, remake, memoise, and knitr.drake
surpasses these tools for its intended use case.drake
for their own projects.drake
surpasses Make.drake
. I think it is still too early.drake
's high-performance computing claims here.drake
build = TRUE
from install_github()
so the "Suggests:" packages such as Ecdat
are not required.drake
conquers the Sisyphean loop of naive data science.drake
drake_example("basic")
wherever load_basic_example()
is called in the documentation. That way, users know how to obtain the underlying code.drake_examples()
here in the README, here in the drake.Rmd vignette, and here in the Get Started page.drake
is claiming about its parallel computing functionality: https://github.com/wlandau-lilly/drake/commit/020f4fb7e452765c9e138e07e5f64a28aa83711e. See here in the parallelism vignette and here in the cautionary vignette. Please also see this comment by @HenrikBengtsson.drake
's role when it comes to reproducibility.drake
helps with reproducibility.run
argument to devtools::run_examples()
, you can override the \dontrun{}
command and execute all the code in the examples. I routinely run devtools::run_examples(run = TRUE)
and devtools::run_examples(run = FALSE)
.drake
.@jules32 @benmarwick @gothub are you happy with the changes made?
Thanks @wlandau-lilly for the summary, and great on having used all feedback! 😸
Yes, thanks, I'm happy. You've done lots of work to address the concerns in my review, that's excellent to see. The package is much more accessible now, and it's easier to see how to get started using it.
Thank you, @benmarwick! I am so glad you think the changes are making a difference.
By the way, I just updated the review summaries in light of wlandau-lilly/drake#195.
All of the points I mentioned have been fully addressed with clear and complete documentation, and I have no other issues, so thumbs up from me.
I have one remaining question that is not really an issue for the review, but I'm just curious about. If i were to develop a workflow with drake, what drake specific artifacts would need to be preserved for a researcher to reproduce my analysis (other than the R scripts and data that may have existed before even starting to build the drake workflow)? I'm assuming that it's simply a CSV from the workflow itself, e.g. 'my_plan.csv' from the basic example. If more than that is required then explaining that in the docs or having an export function would be a good idea.
Yes, I'm happy too. You've addressed the ideas I had and think that this is good to continue moving forward. Nice work!
Thank you so much, @gothub and @jules32! It is exciting to have confirmation from all three reviewers!
@gothub, I really like your question. I will try to break down the answer.
As you say, we need any code that writes functions, generates workflow plans, loads packages, and calls make()
.
In the way I use drake
(and the way I picture others using it), the workflow plan does not actually need to be saved as a csv file. Typically, it is super quick to programmatically generate in an R script before make()
is called. I usually keep a plan.R
file that uses drake_plan()
, plan_analyses()
, plan_summaries()
, expand_plan()
, and evaluate_plan()
to generate the plan. The advantage here is that you can typically spin up thousands of targets with as few as 10-20 lines of human-readable code. (With remake
or Make, you would have to write it all out by hand, which is often intractable.) Then, I source("plan.R")
before calling make()
.
Examples include optional report.Rmd
files, other knitr reports, user-defined CSV data files, and functions created in the R scripts. It's always a good idea to check for missing input files and imported functions/objects. You could either use the missed()
function, or you could use vis_drake_graph()
and look for the purple nodes, which indicate missing objects or files.
Packages are tricker. Maintaining a reproducible package environment is beyond the scope of drake
, and the documentation recommends packrat for this purpose (mentioned in the quickstart and caution vignettes). But if you do not use packrat
, there is still hope. By default, each make()
saves the sessionInfo()
to the cache (use the session_info
argument to deactivate). An independent researcher just needs to call drake_session()
to learn about the required versions of R, packages, and other system requirements.
Including the drake
cache (Usually a hidden .drake/
folder) lets independent researchers use drake_session()
to learn about the package environment, and it allows them to call make()
to check that everything is up to date (and thus re-creatable). This can be tricky: by design, the drake
cache (which is just a storr::storr_rds()
) divides its data over lots of tiny files, which is essential to avoid race conditions in parallel computing. However, this setup behaves poorly under version control, and it may even make it a bit annoying to use a shared file system like Dropbox or Google Drive.
For the version control bit, @noamross recently suggested that drake
supply a flat text log file to give a list of targets and their fingerprints. That way, the git commit history of the log tracks changes to the targets over time. Right now, drake
supports make(..., cache_log_file = "my_log_file.txt")
,
make(..., cache_log_file = TRUE)
, and drake_cache_log_file()
.
@benmarwick, @jules32, @gothub, and @maelle, would any of you like to be listed in the DESCRIPTION as reviewers? (person(..., role = "rev")
).
Editors shouldn't be listed :-) but thanks for asking!
@wlandau-lilly Thanks for the explanation of drake
persistence.
Yes, please include me as a listed reviewer of the package, and thanks!
Done, thanks.
By the way, an Appveyor build failed just now because listenv
is in the middle of an update on CRAN.
There are binary versions available (and will be installed) but the
source versions are later:
binary source
listenv 0.6.0 0.7.0
tibble 1.4.1 1.4.2
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/listenv_0.6.0.zip
In my experience, this kind of error just blows over if I wait a bit. Would that be okay? I ask because my sense is that drake
is about to be approved, so the timing is not ideal.
No problem, I'll test it on Windows later this week when I do the last checks!
Approved! Thanks everyone for a productive and (I hope) enjoyable review process!
To-dos
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). Let me know if you are interested, I'd ping rOpenSci community manager.
I forgot to mention R CMD Check passed without any issue on my laptop on Windows. You might just need to justify having "rev" roles in cran-comments.md (but then it gets accepted because that's a valid use of the role).
Wow, @maelle! Thank you so much! This has been an extremely rewarding and gratifying review process, and I am super grateful for your hard work!
One note before I begin the transition: now that drake is moving to rOpenSci, I want to take this opportunity to shift my open source work to my personal account, @wlandau. My colleagues and I agree this is for the best. Would you invite @wlandau to rOpenSci?
Done!
Wonderful! I transferred the repo, which was super exciting. One small thing: I do not have access to the "Settings" tab, and I cannot change the link to the documentation website at the top. Am I just over-eager, or is there another step?
I think I found the problem, and it is probably a common issue. @maelle, would you add me as a maintainer at https://github.com/orgs/ropensci/teams/drake/members?
Done! 😸
Fantastic! Now, all the links reference rOpenSci rather than @wlandau-lilly. Before I activate Zenodo and submit to JOSS, I will make my final preparations for the release of 5.0.0 (soon to go to CRAN).
I have prepared a narrative-form blog post, and I am excited to share it. Is there somewhere I can submit a PR?
Also, could I append an rOpenSci footer to the README?
Pinging @stefaniebutland
Also, before I create a release for Zenodo, I think I should confirm with @jules32 and @benmarwick about being listed as reviewers in the package itself.
Hi All,
This is exciting to see happen! I am fine not being listed as a reviewer; I hope that my suggestions were helpful but, as they were minimal, I don't think I need to be included in the DESCRIPTION.
But thank you for asking!
On Thu, Jan 25, 2018 at 12:18 PM, Will Landau notifications@github.com wrote:
Also, before I create a release for Zenodo, I think I should confirm with @jules32 https://github.com/jules32 and @benmarwick https://github.com/benmarwick about being listed as reviewers https://github.com/ropensci/drake/blob/649780c2cbb0a7f1241b48926edf588d7c78bc05/DESCRIPTION#L28 in the package itself.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/156#issuecomment-360587983, or mute the thread https://github.com/notifications/unsubscribe-auth/AFnnRROmPRbejeJQ2_Oh_vVDRJpGJHCcks5tOOGwgaJpZM4QYbv- .
--
Julia Stewart Lowndes, PhD Ocean Health Index National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara (UCSB) website http://jules32.github.io/ • ohi-science http://ohi-science.org/ • github https://github.com/jules32 • twitter https://twitter.com/juliesquid
Thanks for letting me know, Julie. Your suggestions absolutely helped, and I am glad you were part of this. Please let me know if you change your mind.
@jules32 as an editor I found your review very good and will most certainly try and recruit you again!
Hello @wlandau-lilly. I'm rOpenSci's Community Manager.
I have prepared a narrative-form blog post, and I am excited to share it. Is there somewhere I can submit a PR? We'd love to have a post about this. Please submit a PR according to: https://github.com/ropensci/roweb2#contributing-a-blog-post. I will review it, provide some feedback and a suggested date to publish it.
Thank you!
Thanks, @stefaniebutland! I just submitted ropensci/roweb2#140. Really looking forward to the next steps.
@maelle, I have added @benmarwick as a reviewer in the DESCRIPTION (permission granted via email), and I have completed the 5 to-dos from here. I also submitted drake
5.0.0 to CRAN, added an rOpenSci footer to the README, and volunteered as a reviewer. This is all super exciting!
Besides the separate processes for the blog entry and the JOSS paper, there are a couple minor loose ends that will probably resolve themselves.
[![Build status](https://ci.appveyor.com/api/projects/status/4ypc9xnmqt70j94e?svg=true)](https://ci.appveyor.com/project/ropensci/drake)
Thanks, @maelle. The AppVeyor builds appear to be working. I will check back on the badge in a few days.
Another minor thing: I am getting emails like this one on every commit:
Build ropensci/drake@810b97d5ba successful: https://ropensci.ocpu.io/drake/
NEW COMMITS: [810b97d5ba] Time: 2018-01-26 08:39:16 ...
Is there a way to attenuate them? Are the other watchers receiving these emails, or just the maintainers? I do not see an OpenCPU webhook for drake, so I am guessing it is at the org level.
Those emails are from the CI system, they are sent to the pusher only. You can ignore/filter them.
Good enough for me, thanks @jeroen.
Summary
The drake package is an R-focused pipeline toolkit. It reproducibly brings results up to date and automatically arranges computations into successive parallelizable stages. It has a Tidyverse-friendly front-end, powerful interactive visuals, and a vast arsenal of multicore and distributed computing backends.
URL: https://github.com/wlandau-lilly/drake
Fit: drake falls easily within reproducibility and high-performance computing.
Target audience: anyone who uses R for medium-to-long computations for which the results need to stay up to date with the dependencies.
Similar work
Remake
Drake overlaps with its direct predecessor, remake. In fact, drake owes its core ideas to remake and @richfitz, and explicit acknowledgements are in the documentation. However, drake surpasses remake in several important ways, including but not limited to the following.
drake::example_drake()
.Factual's drake
Factual's drake is similar in concept, but the development effort is completely unrelated to the R package of the same name.
Other pipeline toolkits
There are many other successful pipeline toolkits, and the drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and parallel computing flexibility.
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.I plan to submit to JOSS in the future, but the manuscript is not currently ready.
Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings: