ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Drake, collaborative projects and heavy MCMC #1346

Closed DRJP closed 3 years ago

DRJP commented 3 years ago

Hi

I am collaborating on a project with a colleague via drake & git. On my side of the project I must run some very heavy MCMCs (written with the R package "nimble" - which compiles R-like code to C++ for speed) which run for many days. I am currently not doing these inside drake, because I don't see how we can avoid my colleague having to rerun all my MCMCs. The situation seems paradoxical and we can't figure out a reasonable work flow.

Do you have any recommendations for how a team can work with drake in situations involving lengthy calculations / simulations?

Cheers David

wlandau commented 3 years ago

For multi-contributor workflows, there is unfortunately not a good way to do this in drake unless you both use the same file system on the same physical machine. drake caches are too heavy to be portable, and there is nothing I can do about it without breaking the whole package. That is one of the primary reasons why I created targets, the long-term successor to drake. In targets, the data store is much lighter, more portable, more resistant to accumulating garbage, and more resilient when files are corrupted. All of this makes it easier to ship the _targets/ data folder to GitHub (for small projects) or OSF/OneDrive/Box/Dropbox/Google Drive (for large projects). Even better, you can actually store everything in one or more AWS S3 buckets. Details on cloud storage are at https://wlandau.github.io/targets-manual/cloud.html. None of this will ever be possible in drake due to permanent design limitations.

So for your case, I definitely recommend:

  1. Use targets.
  2. Ship data to AWS S3: https://wlandau.github.io/targets-manual/cloud.html.
  3. Continue to use Git/GitHub to share code.

Related: I have an example targets workflow to validate a small Bayesian model:

That particular example does not use AWS S3, but it would be straightforward to add.