ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

Dynamic branching #685

Closed wlandau closed 4 years ago

wlandau commented 5 years ago

We want to declare targets and modify the dependency graph while make() is running. Sometimes, we do not know what the targets should be until we see the values of previous targets. The following plan sketches the idea.

library(dplyr)
library(drake)
drake_plan(
  summaries = mtcars %>%
    group_by(cyl) %>%
    summarize(mean_mpg = mean(mpg)),
  individual_summary = target(
    filter(summaries, cyl == cyl_value),
    transform = cross(cyl_value = summaries$cyl)
  )
)

Issues:

  1. How will outdated() work now? Do we have to read the targets back into memory to check if the downstream stuff is up to date?
  2. This is the biggest implementation challenge drake has faced. Hopefully the work will migrate to the workers package.
wlandau commented 5 years ago

Registering dynamic sub-targets requires us to modify config objects, specifically the layout, graph, and priority queue. Because of the way the internals are currently structured, it would be best to modify these objects by reference. We already do this with the priority queue, and it is straightforward enough to use an environment instead of a list for the layout. But we may have to wrap the graph in an environment of its own. Added some action items.

wlandau commented 4 years ago

Unfortunately, dynamic branching is currently slower than static branching when it comes to actually building targets.

library(drake)

plan_dynamic <- drake_plan(
  x = seq_len(1e4),
  y = target(x, dynamic = map(x))
)

plan_static <- drake_plan(
  z = target(w, transform = map(w = !!seq_len(1e4)))
)

cache_dynamic <- storr::storr_rds(tempfile())
cache_static <- storr::storr_rds(tempfile())

system.time(
  config_dynamic <- drake_config(
    plan_dynamic,
    cache = cache_dynamic,
    verbose = 0L
  )
)
#>    user  system elapsed 
#>   0.026   0.003   0.030

system.time(
  config_static <- drake_config(
    plan_static,
    cache = cache_static,
    verbose = 0L
  )
)
#>    user  system elapsed 
#>   1.904   0.004   1.910

system.time(
  suppressWarnings( # different issue
    make(config = config_dynamic)
  )
)
#>    user  system elapsed 
#>  78.014   3.630  81.767

system.time(
  suppressWarnings(
    make(config = config_static)
  )
)
#>    user  system elapsed 
#>  32.712   3.195  36.049

Created on 2019-11-02 by the reprex package (v0.3.0)

wlandau commented 4 years ago

The good news is that make() is much faster to initialize. Because we have smaller plans, drake_config() runs super quickly. And for subsequent make()s, it is faster to check if everything is up to date.

library(drake)
library(profile)
library(jointprof)

plan_dynamic <- drake_plan(
  x = seq_len(1e4),
  y = target(x, dynamic = map(x))
)

plan_static <- drake_plan(
  z = target(w, transform = map(w = !!seq_len(1e4)))
)

cache_dynamic <- storr::storr_rds(tempfile())
cache_static <- storr::storr_rds(tempfile())

system.time(
  config_dynamic <- drake_config(
    plan_dynamic,
    cache = cache_dynamic,
    verbose = 0L
  )
)
#>    user  system elapsed 
#>   0.027   0.003   0.032

system.time(
  config_static <- drake_config(
    plan_static,
    cache = cache_static,
    verbose = 0L
  )
)
#>    user  system elapsed 
#>   3.525   0.004   3.530

Rprof(filename = "dynamic.rprof")
suppressWarnings(
  system.time(make(config = config_dynamic), gcFirst = FALSE)
)
#>    user  system elapsed 
#>  99.096   3.656 102.928
Rprof(NULL)
data <- read_rprof("dynamic.rprof")
write_pprof(data, "dynamic.pprof")

Rprof(filename = "static.rprof")
suppressWarnings(
  system.time(make(config = config_static), gcFirst = FALSE)
)
#>    user  system elapsed 
#>  52.112   3.708  55.916
Rprof(NULL)
data <- read_rprof("static.rprof")
write_pprof(data, "static.pprof")

suppressWarnings(
  system.time(make(config = config_dynamic), gcFirst = FALSE)
)
#>    user  system elapsed 
#>   3.239   0.164   3.418

suppressWarnings(
  system.time(make(config = config_static), gcFirst = FALSE)
)
#>    user  system elapsed 
#>  13.847   0.472  14.347

file.copy("dynamic.pprof", "~/Downloads")
#> [1] TRUE
file.copy("static.pprof", "~/Downloads")
#> [1] TRUE

Created on 2019-11-02 by the reprex package (v0.3.0)

wlandau commented 4 years ago

I used those pprof files at the bottom to generate the flame graphs below. The one on the left is from static branching, and the one on the right is from dynamic branching.

Screenshot_20191102_194727

It looks like the main hangup is loading sub-target dependencies and registering sub-targets. Not too surprising. Speeding this up is going to be another slow-going long-term project. If you have more examples that demonstrate slowness, please post them. It took a long time to get static branching as fast as it is now, and I expect the same for dynamic branching.

wlandau commented 4 years ago

Corrections to https://github.com/ropensci/drake/issues/685#issuecomment-541460119

The implementation in #1042 is different from https://github.com/ropensci/drake/issues/685#issuecomment-541460119. In particular, the flowchart in https://user-images.githubusercontent.com/1580860/66722470-27ede180-eddc-11e9-97ea-930c5a93d287.png.

Procedure for sub-targets

The procedure for sub-targets is actually simpler than I had originally planned.

  1. Check the static triggers of the dynamic target.
  2. If any static trigger fires, build all the sub-targets.
  3. If the static triggers do not fire, check all the sub-targets individually. It is not enough to check the dynamic dependencies as a whole because some of the sub-targets could have been deleted since the last make().

Procedure for dynamic targets as a whole

Each dynamic target has its own value alongside the values of the sub-targets. We recompute this value if

  1. Any sub-target changed, or
  2. Any dynamic dependency changed as a whole.

Why (2)? Because in some situations, we already have all the sub-targets, but we use fewer of them.

library(drake)
plan <- drake_plan(
  x = seq_len(3),
  y = target(x, dynamic = map(x))
)
make(plan)
#> target x
#> subtarget y_0b3474bd
#> subtarget y_b2a5c9b8
#> subtarget y_71f311ad

# readd() and loadd() understand dynamic targets.
readd(y)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3

# But a dynamic target is really just a vector of hashes.
cache <- drake_cache()
cache$get("y")
#> [1] "3908fe5069df3c28" "16b3cb68bd4872ed" "1a3b3c0d06147d80"
#> attr(,"class")
#> [1] "drake_dynamic"

# What if we shorten y?
plan <- drake_plan(
  x = seq_len(2),
  y = target(x, dynamic = map(x))
)

# y needs to change, but we leave the sub-targets alone.
make(plan)
#> target x

# readd() and loadd() understand dynamic targets.
readd(y)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2

# But a dynamic target is really just a vector of hashes.
cache$get("y")
#> [1] "3908fe5069df3c28" "16b3cb68bd4872ed"
#> attr(,"class")
#> [1] "drake_dynamic"

Created on 2019-11-02 by the reprex package (v0.3.0)

Why the cryptic sub-target names?

The sub-target names are ugly (e.g. y_71f311ad1) but incredibly useful.

  1. The suffixes of sub-targets are hashes of dynamic sub-dependencies. In other words, the act of computing the name is the same as checking if it is already up to date! All we need to do is check that the name exists in the cache! (After static triggers, of course.)
  2. The prefixes of static DSL get long and cumbersome too easily. A hash solves this problem because it has a fixed length by design, and it remains valid for all kinds of dynamic dependencies.
  3. A natural alternative is to index the sub-targets numerically, e.g. y_1, y_2, etc. (In fact, that is what I originally proposed in https://github.com/ropensci/drake/issues/685#issuecomment-541460119.) But if we did that, we would invalidate y_2 every time we insert an element in the middle of x. With hashes, we do not have this problem: the sub-targets of y can be in any order and still remain valid.
library(drake)
plan <- drake_plan(
  x = c("a", "b"),
  y = target(x, dynamic = map(x))
)

make(plan)
#> In drake, consider r_make() instead of make(). r_make() runs make() in a fresh R session for enhanced robustness and reproducibility.
#> target x
#> subtarget y_89ca58a1
#> subtarget y_38e75e51

plan <- drake_plan(
  x = c("a", "inserted_element", "b"),
  y = target(x, dynamic = map(x))
)

# Only one sub-target needs to build.
make(plan)
#> target x
#> subtarget y_06d53fef

# Permute x.
plan <- drake_plan(
  x = c("inserted_element", "b", "a"),
  y = target(x, dynamic = map(x))
)

# All sub-targets are still up to date!
make(plan)
#> target x

Created on 2019-11-02 by the reprex package (v0.3.0)

wlandau commented 4 years ago

Implemented in #1042.

wlandau commented 4 years ago

Also noteworthy: mapping over rows: https://github.com/ropensci/drake/pull/1042#issuecomment-549096614

wlandau commented 4 years ago

New chapter in the manual: https://ropenscilabs.github.io/drake-manual/dynamic.html

wlandau commented 4 years ago

One source of overhead I overlooked: computing the hashes of sub-values that go into the names of sub-targets. Unavoidable, but not terrible.

wlandau commented 4 years ago

Dynamic parent targets are already vectors of hashes, so we can avoid this overhead if the dynamic dependency is itself dynamic: 5a07f675b1d0b648d6d61b6fa4cba2465c7bc941. Otherwise, we need to compute the hashes of all the sub-values.

wlandau commented 4 years ago

Update: dynamic branching just got a huge speed boost in #1089 thanks to help from @billdenney and @eddelbuettel. With improvements both in development drake and development digest, dynamic branching is now about 33% faster than static branching overall. Benchmarking workflow: https://github.com/wlandau/drake-examples/blob/master/overhead/dynamic.R vs https://github.com/wlandau/drake-examples/blob/master/overhead/static.R.