ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Functions to import targets from other caches #1339

Closed ghost closed 3 years ago

ghost commented 3 years ago

Prework

This is a follow up to Issue 1100 Using targets imported from another cache.

Proposal

I've created two functions, trigger_from and target_from which simplify the user input process.

require(drake)
require(magrittr)

target_from = function(cache_path, target_name) {
  bquote(ignore(.(drake_cache(path = cache_path))$get(.(target_name)))) %>% eval
}

trigger_from = function(cache_path, target_name) {
  bquote(ignore(trigger(change = .(drake_cache(path = cache_path))$get_hash(.(target_name))))) %>% eval
}

In my work we've used multiple drake plans/projects to keep them small and readable. A final step in our project is to import data from the caches and combine them together for final analyses. The trigger_from and target_from also work with transform.

Example:

library(drake)
library(data.table)

cache1 <- new_cache("../cache1")
cache2 <- new_cache("../cache2")
cache3 <- new_cache("cache3")

plan1 <- drake_plan(
  x = letters[1:5],
  y = "!"
)
make(plan1, cache = cache1)

plan2 <- drake_plan(
  z = "hi"
)
make(plan2, cache = cache2)

ids = data.table(
  cache_paths = c("../cache1",
                  "../cache2"),
  target_names = c("x",
                   "z"),
  names = c("x_cache1",
                      "z_cache2")
)

plan3 <- drake_plan(
  imports = target(
    command = target_from(cache_path, target_name),
    trigger = trigger_from(cache_path, target_name),
    transform = map(
      cache_path = !!ids$cache_paths,
      target_name = !!ids$target_names,
      .names = !!ids$names
    )
  )
)
make(plan3, cache = cache3)

Some extra details regarding the functions: bquote() partially substitutes and evaluates an R expression. Here the partial evaluation occurs within .(). For example .(target_name) gets replaced with the user defined target_name. ignore() tells drake not to look for depencies within this plan, as the targets are coming from external drake plans. eval() evaluates the R code expression from bquote(), try running the bquote() section without the eval() to understand.

wlandau commented 3 years ago

Thanks for these ideas. Several users have asked how to use multiple caches like this.

I would actually prefer to treat this as an optional programming technique rather than a built-in feature. The multiple-cache approach has significant problems and reasonable workarounds. (In fact, targets actively resists multiple caches.) Problems:

  1. It requires the user to have a solid understanding of how caches work, and even then it is easy to make a mistake: for example, accidentally selecting the wrong cache or no cache at all.
  2. The process of shuffling targets to different caches can be computationally inefficient.
  3. Extra flexibility leads to more disorganization. Most users are not software engineers, so they do not have much experience managing technical debt. In other words, is too easy for a new user to create a large unmaintainable mess by accident. For the sake of childproofing, I would prefer to discourage this pattern in most situations.
  4. There are alternative workarounds that add basically the same degrees of flexibility and do not have problems (1)-(3).

Workarounds:

  1. make() different plans with the same cache.
  2. Do all the heavy lifting in completely isolated projects with their own plans, caches, and file systems. Then, tie everything together with a top-level R Markdown report that just readd()s from those caches. (Literally, the only R code in the chunks should be readd() statements, with the possible exception of library(drake).) In this setup, everything is a target until the very last step, and the need to shuffle around to different caches is kept to a minimum.

I have yet to find an example project that cannot be expressed in terms of (1) or (2) above.

ghost commented 3 years ago

Thanks, I agree it is definitely inefficient to duplicate the target storage amongst several caches, my team works in a secure environment which doesn't have access to Git, hence we're trying to be as safe and robust as possible.

I'll give workaround 1. a go, haven't tried multiple plans within the same cache, I assume we have to be very careful with target naming? Two different plans within the same cache should not share the same target name?

wlandau commented 3 years ago

I assume we have to be very careful with target naming? Two different plans within the same cache should not share the same target name?

Yes, that's right. Otherwise, in the worst-case scenario, you will have a self-invalidating workflow. Naming is hard with any kind of programming, and most attempts at a solution are outside the scope of drake.