Closed wlandau closed 5 years ago
config$meta
storr
namespaces at the level of individual data objects:
name
: character string, names of the data objects and/or build stepsis_build_step
: logical, whether the name is also the name of an import/target build stepimported
: logical, imported from the user's workspace/environment or generated with a drake
command?file
: whether the item is a file or simply a generic R object.missing
: whether the item is present in config$envir
hash
: the object's fingerprint.mtime
: file modification time (NULL
for non-files)Additional namespaces for data objects that are also import/target build steps:
command
: command from the drake_plan()
seed
: RNG seedtrigger
: the trigger from the drake_plan()
(or the default trigger)object_in
: names of non-file dependenciesfile_out
: names of any file outputs (potentially more than one)file_in
: names of any non-knitr
input filesknitr_in
: names of any knitr
input filesdependency_hash
: The hash used to trip the depends
trigger.file_hash
: The hash used to trip the file
trigger. All the namespaces that do not exist (yet) shall explicitly be set to NULL
in order to avoid lookup errors.
We should also maintain overall all_targets
and all_imports
lists in config
.
Forgive me if this is a dumb question, but what's the barrier to just using the plan data.frame
for everything and just using R's standard [
lookups or dplyr::filter
? These aren't nearly as fast as a hash table but they don't require building the table, which you would have to do for every make, I presume. I couldn't seem to find any benchmarks comparing storr_environment
to something simple like filter
but maybe something is out there.
ref for benchmarking [
, filter
and the data.table
equivalents: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fast-lookup.html
There is no barrier, technically speaking. I am glad you checked with https://github.com/richfitz/storr/issues/81. From these results, plain environments seem to win out in terms of lookup speed, and I just assumed storr_environment()
s would offer similar speed and more convenience.
From https://github.com/richfitz/storr/issues/81#issuecomment-401631474, I now think the right approach is to have config$meta
be a plain new.env()
with all the individual drake_meta()
lists (augmented with dependency information). Lookup speed, code style, and my ability to solve #435 and #283 will all improve.
Edit: existing drake_meta()
info needs to be computed at the last minute, while dependency information can be resolved all at once up front. I'm thinking this lookup structure can just focus on being a more efficient version of config$graph
, possibly even replacing config$graph
at some point.
This data structure is really a protocol for building the targets. Whereas the drake_plan()
tries to be friendly to the user, the protocol tries to be friendly to the internals.
To do: put an R6
wrapper around config$protocol
with a bunch of safe accessor methods (would also help enforce inherit = FALSE
in get()
).
Is it worth benchmarking get
wrapped up in the R6 piece before going that way?
Should be easy enough to check.
I am going to put this issue on hold because
igraph_options(return.vs.es = FALSE)
, performance issues with graph lookups may not be such a big deal.config$protocol
conceptually overlaps with the roles of config$plan
, config$graph
, and drake_meta()
.The earlier approach was messy because I clung to the igraph
. The way I rely on igraph
's API seems to put strain on the internals. This new data structure, an environment of code_dependencies()
lists, should really replace config$graph
entirely. Execution order can be resolved with the priority queue rather than topological sorting.
I do have a graph_overhaul
branch, but totally gutting the internals is a massive undertaking. Given the reasonable demands of my day job, I am uncertain about my ability to follow through.
Then again, this issue is so huge in scope that more dust needs to settle before I am likely to attempt it seriously.
Also, we really do need to think about situations where a graph is really handy: for example, prune_envir()
with the "lookahead"
strategy and an efficient outdated()
look all the way downstream on the graph. Can the lookup object and priority queue really do the same thing?
From #483, the complete dependency information of targets is neatly arranged in igraph
attributes. With those attributes organized, I think we can revisit a lookup structure to cut back on the multitude of linear-time lookups in drake
. Steps I will be taking to approach the problem:
build_drake_graph()
and replace it with drake_config(...)$graph
.build_drake_graph()
internals into create_drake_lookup()
and create_drake_graph()
, the latter of which will create the original graph (attributes and all) from the lookup structure.igraph
attributes, give the new lookup structure the target-specific elements of the plan. No more linear-time subsetting, e.g. plan$command[plan$target == "my_target]
. We will be using constant-time operations like get(x = "my_target", envir = config$lookup, inherit = FALSE)$command
from now on.igraph
.I do not intend this new lookup data structure to replace the graph, at least at first, but it will be a useful supplement for speeding things up.
I plan to encapsulate the lookup operation internally.
get_lookup <- function (target, field, config){
info <- get(
x = target,
envir = config$lookup,
inherit = FALSE
)
get(
x = field,
envir = info,
inherit = FALSE
)
}
Here, info
is just a pointer (an environment). The two-part dereferencing may create a lot of cache misses (L1-L3 caches) but that's almost certainly not the bottleneck here.
Also, the lookup should be a read-only data structure. possible names:
Leaning towards specification
at this point, with a function get_spec()
and no accompanying set_spec()
.
Before diving headlong into this refactoring, I want to do more in-depth profiling. The magrittr
pipe tends to complicate profvis
output, so a first step is to completely clean out %>%
.
Thanks to #571, profvis
now delivers useful results. #572 accounts for half the target-level overhead we see in drake_example("overhead")
. Given the amount of work and care it would take to solve #440, I am postponing it again until it becomes an actual bottleneck.
See #575.
This feature may not reduce overhead, but it should simplify the architecture.
New favorite term for this structure: "layout". I also thought about "topology", but it seems more redundant with "graph" and implies stronger-than-necessary mathematical claims.
Fixed via #576.
On reflection, the term "layout" is not quite right. Let's go with "workflow specification", or "spec" for short. The spec is drake
's interpretation of the plan. The drake
plan is user-friendly, and all the dependency relationships are implicit. The spec is machine-friendly, and all the dependency relationships are explicit. drake_config()
takes the plan and makes the spec by doing static code analysis on the commands and functions (i.e. analyze_code()
).
The name change is implemented in 322785ead30be24da59ed8f19fbebb26c43c3803.
Lessons from #435 and #283:
drake
spends a lot of time repeatedly looking up target-level information it should already know.Explanation
I think these and similar problems come from the limitations of
drake
's data structures. Theplan
is great for users, but lookup is slow, and it contains only indirect information about each target's dependencies. Thegraph
has complete dependency information, but it does not have other metadata (igraph
attributes are a pain), and traversing the whole graph for dependencies is slow.Solution
I want to put all target-level metadata in a unified fast-lookup data structure. This new
config$meta
will replace individual outputs ofdrake_meta()
flying around, and it will decrease reliance on theplan
andgraph
. For fast lookup, I want to base this data structure on an environment. For convenience, I want it to have astorr
-like API. Best of both worlds: astorr_environment()
cache. (Thank you @richfitz for making it available!)There is a lot of refactoring to do, but I think this will make
drake
a whole lot faster and give it a cleaner code base.