ropensci / unconf18

http://unconf18.ropensci.org/
44 stars 4 forks source link

Caching for drake #30

Open ldecicco-USGS opened 6 years ago

ldecicco-USGS commented 6 years ago

Data scientists are expert at mining large volumes of data to produce insights, predict outcomes, and/or create visuals quickly and methodically. drake (https://github.com/ropensci/drake) has solved a lot of problems in the data-science-pipeline, but one thing we still struggle with is how to effectively collaborate on a large-scale project, without each contributor needing to run all of the workflow, or separating the workflows into many dis-jointed smaller workflows. In some large-scale projects, this is just not feasible.

It would be awesome if a wide community of R developers could come together and try to create a way for drake to have a collaborative caching feature.

My group had set up a wrapper package for remake (drake's predecessor) that allows tiny indicator files to be pushed up to github. These indicator files let the user know that the target was complete and the data was pushed up to some common caching location. The next user would do an upstream pull request from Github, pull down the indicator file. The new user would not need to re-run a target that some other collaborator had already run, but instead pull the data down (if it's needed) rather create it from the workflow. It got a bit awkward because we needed 2-3 remake targets to accomplish this, and that tripped up our "non-power-user" collaborators.

I'd propose the first step would be to develop caching workflow to Google Drive (using the googledrive package). Once the process was flushed out with using Google Drive, it could be more easily expanded to other data storage options (AWS using the aws.s3 package for example).

My gut says this might need to be a wrapper or companion package to drake (to keep the dependent packages minimized), but not sure. @wlandau and other drake experts: I would looove to hear any feedback you have on this idea. If in fact this issues is not-an-issue (ie...drake can already handle caching and I just missed it...totally possible...), then we could morph this issues into a group that helps create more content for a drake blogdown/bookdown book!

The wrapper package for remake is here: https://github.com/USGS-R/scipiper

12 is another drake-based project.

wlandau commented 6 years ago

I would be really stoked to have help on this! Remote/collaborative storage has been a major sticking point (see https://github.com/ropensci/drake/issues/198, https://github.com/richfitz/storr/issues/55, and especially https://github.com/richfitz/storr/issues/61). I looked at scipiper, and I think a package like that could work for drake. I also think we might consider options that do not require changing the workflow plan. Ideally, we should be able to collaborate on Google Drive without adding targets or changing their commands.

A bit of background: drake uses @richfitz's storr package for caching, usually in the local file system. By default, make() creates a storr_rds() cache in a hidden .drake/ directory to store the targets. You can use other storr caches such as storr_environment() and storr_dbi(), but these are not thread safe (e.g. for make(jobs = 8)), and the DBI option requires a database connection that does not carry over to remote jobs (e.g. make(parallelism = "future") on HPC clusters). The guide to customized storage has more details.

As I understand it, the current practice for sharing results is to upload everything and hope that files in the .drake/ cache does not get corrupted along the way. Services like Dropbox create disruptive backup files like the ones @kendonB mentioned here. For drake's default cache (storr_rds(mangle_key = TRUE)), it should be straightforward to clear out those extra files (https://github.com/richfitz/storr/issues/55#issuecomment-374230112). I just have not gotten around to building this into rescue_cache(). @richfitz mentioned that storr might take care of some of the cleaning too (https://github.com/richfitz/storr/issues/55#issuecomment-367837646).

At this point, I think I should touch on some similar ideas / features that might help.

drake hooks

The make() function has a hook argument, and you can use it to wrap things around the code that actually processes the target.

custom_hook <- function(code){
  force(code) # Build and store the target.
  sync_with_google_drive()
}
make(your_plan, hook = custom_hook)

But I have not actually used this feature very much. To be honest, I designed it as a way to silence/redirect output, so we would need to do some internal refactoring on drake itself.

A googledrive driver for storr?

I think it would be fantastic if storr supported an RDS-like driver powered by googledrive.

library(drake)
library(storr)
plan <- drake_plan(...)
cache <- storr_googledrive("https://drive.google.com/drive/my-drive/stuff/.drake")
make(plan, cache = cache)

But from https://github.com/richfitz/storr/issues/61, that may be asking too much.

Target logs: fingerprinting your pipeline

As for communication, @noamross had the bright idea of writing a log file to fingerprint the cache. If you commit it to GitHub, the changelog will show the targets that changed on each commit.

library(drake)
load_basic_example()
make(my_plan, cache_log_file = "log.txt", verbose = FALSE)
head(read.table("log.txt", header = TRUE))
#>               hash   type                   name
#> 1 de0922cd962af6e2 target coef_regression1_large
#> 2 331afc4b2b42b57f target coef_regression1_small
#> 3 def5102800992696 target coef_regression2_large
#> 4 2e52655d4d9ddb47 target coef_regression2_small
#> 5 478684feec29a859 import             data.frame
#> 6 e3bf796806094874 import                   knit
wlandau commented 6 years ago

Update: if non-custom drake caches get corrupted when you upload them to Google Drive or Dropbox or the like, you can try to fix the problem with drake_gc() (development drake only). Related:

ldecicco-USGS commented 6 years ago

Summary: Brainstorm/create tools/create blog on the best way to incorporate caching with Drake.

wlandau commented 6 years ago

I would like to add another approach to this thread. By default, drake uses a storr_rds() cache (from storr) because it is fast and thread-safe. What makes this format difficult for collaboration is the swarm of RDS files that need to be uploaded. But you can supply your own storr_dbi() cache to the cache argument of make(). It even works with make(parallelism = "clustermq_staged") and make(parallelism = "future", caching = "master"). The advantage here is you only have a single database file to upload and sync.

wlandau commented 5 years ago

@ldecicco-USGS, FYI: I just rewrote the drake manual's chapter on caches. I think the new revision more clearly explains drake's existing cache system.

In addition, the section on database caches shows one way to work around some portability issues. With some minor parallel computing caveats, it is straightforward to use a single SQLite database file as a drake cache. See also https://github.com/wlandau/drake-examples/tree/master/dbi and drake::drake_example("dbi").

Theoretically, it should also be possible to convert an existing drake cache to and from a single SQLite database file. For now, however, this process seems to invalidate targets: https://github.com/richfitz/storr/issues/93. cc @richfitz

ldecicco-USGS commented 5 years ago

This is great! Thanks for writing up such a nice explanation.

wlandau commented 5 years ago

You are welcome, please let me know if you think of more ways to ease collaboration on drake projects.