Open ldecicco-USGS opened 6 years ago
I would be really stoked to have help on this! Remote/collaborative storage has been a major sticking point (see https://github.com/ropensci/drake/issues/198, https://github.com/richfitz/storr/issues/55, and especially https://github.com/richfitz/storr/issues/61). I looked at scipiper
, and I think a package like that could work for drake
. I also think we might consider options that do not require changing the workflow plan. Ideally, we should be able to collaborate on Google Drive without adding targets or changing their commands.
A bit of background: drake
uses @richfitz's storr
package for caching, usually in the local file system. By default, make()
creates a storr_rds()
cache in a hidden .drake/
directory to store the targets. You can use other storr
caches such as storr_environment()
and storr_dbi()
, but these are not thread safe (e.g. for make(jobs = 8)
), and the DBI
option requires a database connection that does not carry over to remote jobs (e.g. make(parallelism = "future")
on HPC clusters). The guide to customized storage has more details.
As I understand it, the current practice for sharing results is to upload everything and hope that files in the .drake/
cache does not get corrupted along the way. Services like Dropbox create disruptive backup files like the ones @kendonB mentioned here. For drake
's default cache (storr_rds(mangle_key = TRUE)
), it should be straightforward to clear out those extra files (https://github.com/richfitz/storr/issues/55#issuecomment-374230112). I just have not gotten around to building this into rescue_cache()
. @richfitz mentioned that storr
might take care of some of the cleaning too (https://github.com/richfitz/storr/issues/55#issuecomment-367837646).
At this point, I think I should touch on some similar ideas / features that might help.
The make()
function has a hook
argument, and you can use it to wrap things around the code that actually processes the target.
custom_hook <- function(code){
force(code) # Build and store the target.
sync_with_google_drive()
}
make(your_plan, hook = custom_hook)
But I have not actually used this feature very much. To be honest, I designed it as a way to silence/redirect output, so we would need to do some internal refactoring on drake
itself.
I think it would be fantastic if storr
supported an RDS-like driver powered by googledrive
.
library(drake)
library(storr)
plan <- drake_plan(...)
cache <- storr_googledrive("https://drive.google.com/drive/my-drive/stuff/.drake")
make(plan, cache = cache)
But from https://github.com/richfitz/storr/issues/61, that may be asking too much.
As for communication, @noamross had the bright idea of writing a log file to fingerprint the cache. If you commit it to GitHub, the changelog will show the targets that changed on each commit.
library(drake)
load_basic_example()
make(my_plan, cache_log_file = "log.txt", verbose = FALSE)
head(read.table("log.txt", header = TRUE))
#> hash type name
#> 1 de0922cd962af6e2 target coef_regression1_large
#> 2 331afc4b2b42b57f target coef_regression1_small
#> 3 def5102800992696 target coef_regression2_large
#> 4 2e52655d4d9ddb47 target coef_regression2_small
#> 5 478684feec29a859 import data.frame
#> 6 e3bf796806094874 import knit
Update: if non-custom drake
caches get corrupted when you upload them to Google Drive or Dropbox or the like, you can try to fix the problem with drake_gc()
(development drake
only). Related:
Summary: Brainstorm/create tools/create blog on the best way to incorporate caching with Drake.
I would like to add another approach to this thread. By default, drake
uses a storr_rds()
cache (from storr
) because it is fast and thread-safe. What makes this format difficult for collaboration is the swarm of RDS files that need to be uploaded. But you can supply your own storr_dbi()
cache to the cache
argument of make()
. It even works with make(parallelism = "clustermq_staged")
and make(parallelism = "future", caching = "master")
. The advantage here is you only have a single database file to upload and sync.
@ldecicco-USGS, FYI: I just rewrote the drake
manual's chapter on caches. I think the new revision more clearly explains drake
's existing cache system.
In addition, the section on database caches shows one way to work around some portability issues. With some minor parallel computing caveats, it is straightforward to use a single SQLite database file as a drake
cache. See also https://github.com/wlandau/drake-examples/tree/master/dbi and drake::drake_example("dbi")
.
Theoretically, it should also be possible to convert an existing drake
cache to and from a single SQLite database file. For now, however, this process seems to invalidate targets: https://github.com/richfitz/storr/issues/93. cc @richfitz
This is great! Thanks for writing up such a nice explanation.
You are welcome, please let me know if you think of more ways to ease collaboration on drake
projects.
Data scientists are expert at mining large volumes of data to produce insights, predict outcomes, and/or create visuals quickly and methodically.
drake
(https://github.com/ropensci/drake) has solved a lot of problems in the data-science-pipeline, but one thing we still struggle with is how to effectively collaborate on a large-scale project, without each contributor needing to run all of the workflow, or separating the workflows into many dis-jointed smaller workflows. In some large-scale projects, this is just not feasible.It would be awesome if a wide community of R developers could come together and try to create a way for
drake
to have a collaborative caching feature.My group had set up a wrapper package for
remake
(drake
's predecessor) that allows tiny indicator files to be pushed up to github. These indicator files let the user know that the target was complete and the data was pushed up to some common caching location. The next user would do an upstream pull request from Github, pull down the indicator file. The new user would not need to re-run a target that some other collaborator had already run, but instead pull the data down (if it's needed) rather create it from the workflow. It got a bit awkward because we needed 2-3 remake targets to accomplish this, and that tripped up our "non-power-user" collaborators.I'd propose the first step would be to develop caching workflow to Google Drive (using the
googledrive
package). Once the process was flushed out with using Google Drive, it could be more easily expanded to other data storage options (AWS using theaws.s3
package for example).My gut says this might need to be a wrapper or companion package to
drake
(to keep the dependent packages minimized), but not sure. @wlandau and otherdrake
experts: I would looove to hear any feedback you have on this idea. If in fact this issues is not-an-issue (ie...drake
can already handle caching and I just missed it...totally possible...), then we could morph this issues into a group that helps create more content for adrake
blogdown/bookdown book!The wrapper package for
remake
is here: https://github.com/USGS-R/scipiper12 is another
drake
-based project.