A new internal fast-lookup data structure for target information

wlandau commented 6 years ago

Lessons from #435 and #283:

drake spends a lot of time repeatedly looking up target-level information it should already know.
The dependency network is inflexible.

Explanation

I think these and similar problems come from the limitations of drake's data structures. The plan is great for users, but lookup is slow, and it contains only indirect information about each target's dependencies. The graph has complete dependency information, but it does not have other metadata (igraph attributes are a pain), and traversing the whole graph for dependencies is slow.

Solution

I want to put all target-level metadata in a unified fast-lookup data structure. This new config$meta will replace individual outputs of drake_meta() flying around, and it will decrease reliance on the plan and graph. For fast lookup, I want to base this data structure on an environment. For convenience, I want it to have a storr-like API. Best of both worlds: a storr_environment() cache. (Thank you @richfitz for making it available!)

There is a lot of refactoring to do, but I think this will make drake a whole lot faster and give it a cleaner code base.

wlandau commented 6 years ago

config$meta storr namespaces at the level of individual data objects:

name: character string, names of the data objects and/or build steps
is_build_step: logical, whether the name is also the name of an import/target build step
imported: logical, imported from the user's workspace/environment or generated with a drake command?
file: whether the item is a file or simply a generic R object.
missing: whether the item is present in config$envir
hash: the object's fingerprint.
mtime: file modification time (NULL for non-files)

Additional namespaces for data objects that are also import/target build steps:

command: command from the drake_plan()
seed: RNG seed
trigger: the trigger from the drake_plan() (or the default trigger)
object_in: names of non-file dependencies
file_out: names of any file outputs (potentially more than one)
file_in: names of any non-knitr input files
knitr_in: names of any knitr input files
dependency_hash: The hash used to trip the depends trigger.
file_hash: The hash used to trip the file trigger.

All the namespaces that do not exist (yet) shall explicitly be set to NULL in order to avoid lookup errors.

wlandau commented 6 years ago

We should also maintain overall all_targets and all_imports lists in config.

kendonB commented 6 years ago

Forgive me if this is a dumb question, but what's the barrier to just using the plan data.frame for everything and just using R's standard [ lookups or dplyr::filter? These aren't nearly as fast as a hash table but they don't require building the table, which you would have to do for every make, I presume. I couldn't seem to find any benchmarks comparing storr_environment to something simple like filter but maybe something is out there.

ref for benchmarking [, filter and the data.table equivalents: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fast-lookup.html

wlandau commented 6 years ago

There is no barrier, technically speaking. I am glad you checked with https://github.com/richfitz/storr/issues/81. From these results, plain environments seem to win out in terms of lookup speed, and I just assumed storr_environment()s would offer similar speed and more convenience.

From https://github.com/richfitz/storr/issues/81#issuecomment-401631474, I now think the right approach is to have config$meta be a plain new.env() with all the individual drake_meta() lists (augmented with dependency information). Lookup speed, code style, and my ability to solve #435 and #283 will all improve.

wlandau commented 6 years ago

Edit: existing drake_meta() info needs to be computed at the last minute, while dependency information can be resolved all at once up front. I'm thinking this lookup structure can just focus on being a more efficient version of config$graph, possibly even replacing config$graph at some point.

wlandau commented 6 years ago

This data structure is really a protocol for building the targets. Whereas the drake_plan() tries to be friendly to the user, the protocol tries to be friendly to the internals.

wlandau commented 6 years ago

To do: put an R6 wrapper around config$protocol with a bunch of safe accessor methods (would also help enforce inherit = FALSE in get()).

kendonB commented 6 years ago

Is it worth benchmarking get wrapped up in the R6 piece before going that way?

wlandau commented 6 years ago

Should be easy enough to check.

wlandau commented 6 years ago

I am going to put this issue on hold because

435 was solved a different way. With igraph_options(return.vs.es = FALSE), performance issues with graph lookups may not be such a big deal.
In my work on #457, I realized that #440 may complicate the internals more than it is worth. It is extra work and clutter to maintain a new data structure, and it adds confusion because config$protocol conceptually overlaps with the roles of config$plan, config$graph, and drake_meta().
369 is all that could stand to gain from #440 right now, and I want to solve it in a way that cleans up the code instead of complicating it.

wlandau commented 6 years ago

The earlier approach was messy because I clung to the igraph. The way I rely on igraph's API seems to put strain on the internals. This new data structure, an environment of code_dependencies() lists, should really replace config$graph entirely. Execution order can be resolved with the priority queue rather than topological sorting.

I do have a graph_overhaul branch, but totally gutting the internals is a massive undertaking. Given the reasonable demands of my day job, I am uncertain about my ability to follow through.

wlandau commented 6 years ago

Then again, this issue is so huge in scope that more dust needs to settle before I am likely to attempt it seriously.

wlandau commented 6 years ago

Also, we really do need to think about situations where a graph is really handy: for example, prune_envir() with the "lookahead" strategy and an efficient outdated() look all the way downstream on the graph. Can the lookup object and priority queue really do the same thing?

wlandau commented 6 years ago

From #483, the complete dependency information of targets is neatly arranged in igraph attributes. With those attributes organized, I think we can revisit a lookup structure to cut back on the multitude of linear-time lookups in drake. Steps I will be taking to approach the problem:

[x] Deprecate build_drake_graph() and replace it with drake_config(...)$graph.
[ ] Divide the old build_drake_graph() internals into create_drake_lookup() and create_drake_graph(), the latter of which will create the original graph (attributes and all) from the lookup structure.
[ ] In addition to the original igraph attributes, give the new lookup structure the target-specific elements of the plan. No more linear-time subsetting, e.g. plan$command[plan$target == "my_target]. We will be using constant-time operations like get(x = "my_target", envir = config$lookup, inherit = FALSE)$command from now on.
[ ] Once those lookups are taken care of, take the extra attributes out of the original igraph.

wlandau commented 6 years ago

I do not intend this new lookup data structure to replace the graph, at least at first, but it will be a useful supplement for speeding things up.

wlandau commented 6 years ago

I plan to encapsulate the lookup operation internally.

get_lookup <- function (target, field, config){
  info <- get(
    x = target,
    envir = config$lookup,
    inherit = FALSE
  )
  get(
    x = field,
    envir = info,
    inherit = FALSE
  )
}

Here, info is just a pointer (an environment). The two-part dereferencing may create a lot of cache misses (L1-L3 caches) but that's almost certainly not the bottleneck here.

Also, the lookup should be a read-only data structure. possible names:

lookup
specification
register
registry
roster

Leaning towards specification at this point, with a function get_spec() and no accompanying set_spec().

wlandau commented 6 years ago

Before diving headlong into this refactoring, I want to do more in-depth profiling. The magrittr pipe tends to complicate profvis output, so a first step is to completely clean out %>%.

wlandau commented 6 years ago

Thanks to #571, profvis now delivers useful results. #572 accounts for half the target-level overhead we see in drake_example("overhead"). Given the amount of work and care it would take to solve #440, I am postponing it again until it becomes an actual bottleneck.

wlandau commented 6 years ago

See #575.

wlandau commented 6 years ago

This feature may not reduce overhead, but it should simplify the architecture.

wlandau commented 6 years ago

Just started the work here.

wlandau commented 5 years ago

New favorite term for this structure: "layout". I also thought about "topology", but it seems more redundant with "graph" and implies stronger-than-necessary mathematical claims.

wlandau commented 5 years ago

Fixed via #576.

wlandau commented 4 years ago

On reflection, the term "layout" is not quite right. Let's go with "workflow specification", or "spec" for short. The spec is drake's interpretation of the plan. The drake plan is user-friendly, and all the dependency relationships are implicit. The spec is machine-friendly, and all the dependency relationships are explicit. drake_config() takes the plan and makes the spec by doing static code analysis on the commands and functions (i.e. analyze_code()).

The name change is implemented in 322785ead30be24da59ed8f19fbebb26c43c3803.

ropensci / drake