ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

A new internal fast-lookup data structure for target information #440

Closed wlandau closed 5 years ago

wlandau commented 6 years ago

Lessons from #435 and #283:

Explanation

I think these and similar problems come from the limitations of drake's data structures. The plan is great for users, but lookup is slow, and it contains only indirect information about each target's dependencies. The graph has complete dependency information, but it does not have other metadata (igraph attributes are a pain), and traversing the whole graph for dependencies is slow.

Solution

I want to put all target-level metadata in a unified fast-lookup data structure. This new config$meta will replace individual outputs of drake_meta() flying around, and it will decrease reliance on the plan and graph. For fast lookup, I want to base this data structure on an environment. For convenience, I want it to have a storr-like API. Best of both worlds: a storr_environment() cache. (Thank you @richfitz for making it available!)

There is a lot of refactoring to do, but I think this will make drake a whole lot faster and give it a cleaner code base.

wlandau commented 6 years ago

config$meta storr namespaces at the level of individual data objects:

Additional namespaces for data objects that are also import/target build steps:

All the namespaces that do not exist (yet) shall explicitly be set to NULL in order to avoid lookup errors.

wlandau commented 6 years ago

We should also maintain overall all_targets and all_imports lists in config.

kendonB commented 6 years ago

Forgive me if this is a dumb question, but what's the barrier to just using the plan data.frame for everything and just using R's standard [ lookups or dplyr::filter? These aren't nearly as fast as a hash table but they don't require building the table, which you would have to do for every make, I presume. I couldn't seem to find any benchmarks comparing storr_environment to something simple like filter but maybe something is out there.

ref for benchmarking [, filter and the data.table equivalents: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fast-lookup.html

wlandau commented 6 years ago

There is no barrier, technically speaking. I am glad you checked with https://github.com/richfitz/storr/issues/81. From these results, plain environments seem to win out in terms of lookup speed, and I just assumed storr_environment()s would offer similar speed and more convenience.

From https://github.com/richfitz/storr/issues/81#issuecomment-401631474, I now think the right approach is to have config$meta be a plain new.env() with all the individual drake_meta() lists (augmented with dependency information). Lookup speed, code style, and my ability to solve #435 and #283 will all improve.

wlandau commented 6 years ago

Edit: existing drake_meta() info needs to be computed at the last minute, while dependency information can be resolved all at once up front. I'm thinking this lookup structure can just focus on being a more efficient version of config$graph, possibly even replacing config$graph at some point.

wlandau commented 6 years ago

This data structure is really a protocol for building the targets. Whereas the drake_plan() tries to be friendly to the user, the protocol tries to be friendly to the internals.

wlandau commented 6 years ago

To do: put an R6 wrapper around config$protocol with a bunch of safe accessor methods (would also help enforce inherit = FALSE in get()).

kendonB commented 6 years ago

Is it worth benchmarking get wrapped up in the R6 piece before going that way?

wlandau commented 6 years ago

Should be easy enough to check.

wlandau commented 6 years ago

I am going to put this issue on hold because

  1. 435 was solved a different way. With igraph_options(return.vs.es = FALSE), performance issues with graph lookups may not be such a big deal.

  2. In my work on #457, I realized that #440 may complicate the internals more than it is worth. It is extra work and clutter to maintain a new data structure, and it adds confusion because config$protocol conceptually overlaps with the roles of config$plan, config$graph, and drake_meta().
  3. 369 is all that could stand to gain from #440 right now, and I want to solve it in a way that cleans up the code instead of complicating it.

wlandau commented 6 years ago

The earlier approach was messy because I clung to the igraph. The way I rely on igraph's API seems to put strain on the internals. This new data structure, an environment of code_dependencies() lists, should really replace config$graph entirely. Execution order can be resolved with the priority queue rather than topological sorting.

I do have a graph_overhaul branch, but totally gutting the internals is a massive undertaking. Given the reasonable demands of my day job, I am uncertain about my ability to follow through.

wlandau commented 6 years ago

Then again, this issue is so huge in scope that more dust needs to settle before I am likely to attempt it seriously.

wlandau commented 6 years ago

Also, we really do need to think about situations where a graph is really handy: for example, prune_envir() with the "lookahead" strategy and an efficient outdated() look all the way downstream on the graph. Can the lookup object and priority queue really do the same thing?

wlandau commented 6 years ago

From #483, the complete dependency information of targets is neatly arranged in igraph attributes. With those attributes organized, I think we can revisit a lookup structure to cut back on the multitude of linear-time lookups in drake. Steps I will be taking to approach the problem:

wlandau commented 6 years ago

I do not intend this new lookup data structure to replace the graph, at least at first, but it will be a useful supplement for speeding things up.

wlandau commented 6 years ago

I plan to encapsulate the lookup operation internally.

get_lookup <- function (target, field, config){
  info <- get(
    x = target,
    envir = config$lookup,
    inherit = FALSE
  )
  get(
    x = field,
    envir = info,
    inherit = FALSE
  )
}

Here, info is just a pointer (an environment). The two-part dereferencing may create a lot of cache misses (L1-L3 caches) but that's almost certainly not the bottleneck here.

Also, the lookup should be a read-only data structure. possible names:

Leaning towards specification at this point, with a function get_spec() and no accompanying set_spec().

wlandau commented 6 years ago

Before diving headlong into this refactoring, I want to do more in-depth profiling. The magrittr pipe tends to complicate profvis output, so a first step is to completely clean out %>%.

wlandau commented 6 years ago

Thanks to #571, profvis now delivers useful results. #572 accounts for half the target-level overhead we see in drake_example("overhead"). Given the amount of work and care it would take to solve #440, I am postponing it again until it becomes an actual bottleneck.

wlandau commented 6 years ago

See #575.

wlandau commented 6 years ago

This feature may not reduce overhead, but it should simplify the architecture.

wlandau commented 6 years ago

Just started the work here.

wlandau commented 5 years ago

New favorite term for this structure: "layout". I also thought about "topology", but it seems more redundant with "graph" and implies stronger-than-necessary mathematical claims.

wlandau commented 5 years ago

Fixed via #576.

wlandau commented 4 years ago

On reflection, the term "layout" is not quite right. Let's go with "workflow specification", or "spec" for short. The spec is drake's interpretation of the plan. The drake plan is user-friendly, and all the dependency relationships are implicit. The spec is machine-friendly, and all the dependency relationships are explicit. drake_config() takes the plan and makes the spec by doing static code analysis on the commands and functions (i.e. analyze_code()).

The name change is implemented in 322785ead30be24da59ed8f19fbebb26c43c3803.