richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
117 stars 10 forks source link

Best practices for collaborative work on a single RDS storr? #92

Open wlandau opened 6 years ago

wlandau commented 6 years ago

@ldecicco-usgs has raised this issue as it applies to drake. It can be challenging to commit/upload all the tiny files of an RDS storr to GitHub/Dropbox/Google Drive. I think a vignette might help if there are counterintuitive workarounds and/or good existing best practices. Related: #1, #4, #16, #37, #77.

richfitz commented 5 years ago

There's import/export support between stores, that idea could be extended to archive to zip - I thought I'd done that already but seems not.

Can you sketch out what you need as I don't currently have this use case. Once I can see requirements I can look at what it would take to support

wlandau commented 5 years ago

I am not sure I can sketch out any one specific best solution right now. I am just looking for guidance on the most efficient ways to transport RDS stores so I can provide better recommendations on how multiple users can collaborate on drake projects. I have some ideas, but not all of them work, and I do not think they are exhaustive.

Even outside storr's API, it is already straightforward to zip up and transport stores. drake targets even remain up to date that way.

library(drake)
load_mtcars_example() # Load packages and functions (& write report.Rmd)
make(my_plan, verbose = FALSE) # Run the project
zip(zipfile = "cache.zip", files = ".drake", flags = "-qr9X") # Zip up the cache.
wd <- getwd() # Directory where the project was built.
dir <- tempfile() # Go to a new directory.
dir.create(dir)
setwd(dir)
load_mtcars_example() # Load packages and functions (& write report.Rmd)
unzip(file.path(wd, "cache.zip")) # Unpack the cache.
tmp <- file.copy(from = file.path(wd, "report.md"), to = ".") # Get the compiled report too.
make(my_plan) # Everything is up to date in the new location.
#> All targets are already up to date.

Created on 2018-12-13 by the reprex package (v0.2.1)

However, this process duplicates information, and it would not bode well for projects with large datasets. That is why I think https://github.com/richfitz/storr/issues/93 could go a long way towards a good recommendation.

For projects with the potential for collaboration, I already recommend starting out with DBI format, but this does not help existing drake workflows.

With storrs (ref: https://github.com/richfitz/storr/issues/61) collaboration might not require any copying at all, but this would almost certainly slow down the execution of drake::make().

I also wonder if containerization can help somehow. It's hard to beat reproducibility and portability in the feature set covered by Docker and Singularity.

r2evans commented 4 years ago

@wlandau , updated link from your last comment about DBI, I believe: https://books.ropensci.org/drake/storage.html#interfaces-to-the-cache