Closed krlmlr closed 5 years ago
@AlexAxthelm: Just to clarify: We use these verbs only to describe the workflow, the actual execution plan will be a full expansion. But if we know that e.g. all analyses
have been built, we don't need to look at the full expansion at all. We still need to figure out how to handle all this internally.
You may not have a clear answer for this, but as a design question, would each element (row) inside one of these tibble targets show separately on outdated
, or the graph? Or would all of analyses
need rebuilt if a single one fell outdated? Not trying to be difficult, just having trouble imagining how this will play out in execution.
These questions are great!
We have an internal graph, there each row in analyses
corresponds to an internal target which has its separate up-to-dateness. We never materialize the full internal graph, but to build analyses
we of course look at all internal dependencies. We don't need to update those rows in analyses
that are up to date, and we don't materialize the entire analyses
data frame either unless the user asks for it. For visualization, we can add a "partially outdated" state if only some rows in analyses
are stale.
That sounds awesome. I'm fully in support of this over #77 then. A feature that I would find useful as a user, would be to have an accessible estimate on how outdated a target is, e.g 7 of 9 elements need rebuilt. Other than that, I think all my concerns are met.
I am having trouble understanding this discussion, possibly because I do not use dplyr
very much. Previously, I thought we were using the DSL to expand out a whole workflow plan data frame and then feed it to make()
as usual. How long exactly will the expansion be delayed?
That's a point I haven't stressed enough. We use the DSL to avoid plan expansion for as long as we possibly can.
That would certainly gain us efficiency, and it would solve the catch-22 from #77 (group_by()
without having the data yet). It's also a hard and delicate problem because we will need to make changes to our internal igraph
object (config$graph
) while make()
is happening. All the current parallel backends require everything to be planned in advance, so I think we will need #227 first. Maybe we could work on the DSL interface and #227 at the same time. Then when #227 is solved, we could remove most of the other backends and delay expansion for as long as possible.
Thankfully, none of this should affect the cache, which is the most sensitive component of all when it comes to back compatibility.
@krlmlr, to confirm, the drake_plan
sktech that you're outlined above, will it yield a plan that looks like what we currently see when we call load_basic_example()
, or will it give something more along the lines of what we see if we call that now (current version of drake)? I'm worried that there is some weirdness going on with the mutate(result = list(reg(dataset)))
which is intuitive, but I don't know how to look for reg
aside from searching for column names in dependencies, but then that is assuming that everything is a tibble? I guess I'm back to being a bit unclear on if the new plans are aimed at creating targets that are tibbles, or creating plans that are tibbles?
This seems like there is potential for a lot of good power here, but I'm concerned about imposing a dplyr point-of-view on an otherwise agnostic tool.
For the sake of slow deprecation, I think we will get the chance to see how well the two interfaces coexist. Ideally, I would like drake
to still be able to easily handle a data frame of text target names and text commands. I believe this format is very reachable for people new to R, and parsing is usually easy.
There is nothing incorrect about wildcard templating, but I think we could eventually offload it to the wildcard package and extend wildcard
to handle tibbles of language objects. See #240 and wlandau/wildcard#7.
I'd say everything's a tibble, but I'm not sure about details here.
I need to take a closer look at wildcard.
There's not actually that much to look at, it's a simple idea I took out of remakeGenerator.
I would be in favor of turning ordinary workflow data frames into tibbles as early on as possible. It really is about time.
Speaking of tibbles, there was a discussion somewhere about fixed column width printing, but I can't seem to find it. Workflow plan commands can get long.
Character columns take as much space as they can get in tibble, but embedded newlines are not a problem. Does that answer your question?
I think I finally figured out how to express my concerns about all targets being tibbles: What happens when I want to build something which is normally a tibble?
drake_plan(
foo = mtcars %>% as.tibble() %>% filter(am == 1)
)
will this try to bring all of the parallelism to bear on each row of the tibble? Is there a way to separate a a tibble which is "just a tibble" from one that is "a drake_plan™ tibble"? What happens if I want to use NSE to build my plan as a tibble outside of drake_plan?
We might need to invent verbs like drake_mutate()
and drake_rowwise()
etc. to avoid this confusion. Or we enter "drake" mode with another verb.
I am all for this! Dynamic branching ("delayed expansion"?) is very exciting and I recently ran into a problem with my data work that required it.
I think an entirely different set of functions should be used, so that it's clear what is a tibble operation within a target, and which is an operation involving many drake targets. I suppose these functions would only be valid inside a target definition, like n()
or everything()
.
plan_drake(
small = simulate(48),
large = simulate(64),
analyses = expand_target(
reg(dataset),
# crossing() is implicit
reg = list(reg1, reg2),
dataset = list(small, large)
),
summaries = expand_target(
# I wasn't sure why you had fun(dataset, result) here
# Not yet sure how to include that in this proposal
fun(result),
fun = list(coefficients, residuals),
result = analyses
),
# gather_targets evaluates to a tibble
# with a column for every expansion term
# used previously I guess? & value column
winners = gather_targets(summaries) %>%
group_by(dataset, fun) %>%
summarize(winner = min(value))
)
Expanding on this idea of special functions within target definitions, I've been imagining a feature where you could specify file targets & file dependencies from target defs instead of messing around with the single-quotes vs double-quotes thing. Or even triggers. EDIT: krlmlr had the same idea in #232 oops
plan_drake(
imported_data = read_csv(target_filedep("data.csv")),
report.md = target_file(knit(x)),
always_build = target_trigger("always", fun(x))
)
Need to keep in mind multi-file targets (#283) or directory targets (#12), and their R equivalents (named lists, recursive lists).
Let me jump into this discussion. Even though I use R regularly, I am still (heavily?) confused by the whole tidyverse stuff and the verbs you are discussing (maybe because I don't use the tidyverse daily?).
I personally think that the logic of the drake plan, and the verbs being discussed here, should be easily understandable by even non-tidyverse users. The tidyverse verbs seem to be great, but each time I have to use any of it's functions, I have to search for examples to understand them.
In a nutshell: I don't think a fluency in the tidyverse logic should be a prerequisite to use these verbs discussed here.
I would opt to trying to stick to the basic R logic as far as possible when building / using these verbs initially. This would open these features to many more users, and one could still add this tidyverse logic afterwards, probably as an additional package. One could even define an api to add user-defined verbs, and have different packages which can be loaded which contain these definitions - which would be, thinking about it, the most flexible approach to solve this problem.
From @rkrug via #298: we should think about a verb to chain targets together sequentially.
Yes, the ideas come from the tidyverse
, but no prior tidyverse
knowledge will be required. As I understand it, the verbs for creating plans will stand on their own, and documentation in drake
will spell out what they mean.
OK - that would be very useful.
Hi all --
I've been brainstorming an alternative DSL recently and I have a first draft here: https://github.com/dapperjapper/drake/wiki/DSL-Proposal
It takes from tidyr's nest()
and unnest()
concepts, and the prototype interpreter I created actually uses unnest()
.
Basically, dynamic branching / delayed expansion is made intuitive by turning all expansion vectors into targets themselves. These targets can be "iterated" over by unnesting them with expand_targets()
. Here is the most simple version of this pattern:
plan_df(
iterator = c(1, 4, 5)
) %>%
expand_targets(iterator) %>%
add_to_plan(
squared = iterator*iterator
)
Where there are three different versions of the target squared
, depending on the different interator
s. If multiplication was extremely computationally expensive, each squared
could be computed as a separate job via any drake parallelism.
gather
ing targets is equivalent to a nest()
operation (called collapse_targets
here):
plan_df(
iterator = c(1, 4, 5)
) %>%
expand_targets(iterator) %>%
add_to_plan(
squared = iterator*iterator
) %>%
collapse_targets(iterator) %>%
add_to_plan(
sum_squares = sum(unlist(squared))
)
I believe this to be a little bit easier to read and reason about compared to @krlmlr's (very inspiring) meta-make vision. Implementation will be a challenge... but do ppl think this is easier to understand?
Thanks, @dapperjapper! Sorry for my slow response. I have been waiting for a large chunk of quality time to seriously think about this. I will do my best to post feedback this week.
I looked at your proposal and tried out the code. Although I still have not wrapped my head around the details, I feel like I understand the concept a bit better. It seems like we group targets into lists, define dependency relationships among the groupings, and then expand things out at the last minute. The working code really helped. I like the analogy with nest()
and unnest()
.
I find the data structures in this proposal somewhat unfamiliar in this use case. What is your current thinking on those wide-form tibble
s?
> plan
# A tibble: 1 x 7
small large dataset reg summary_fun analysis summaries
<list> <list> <list> <list> <list> <list> <list>
1 <language> <language> <language> <language> <language> <language> <language>
It seems like the intent is to generate rows for the expansion.
> collect_plan(expanded_plan)
# A tibble: 8 x 7
small large analysis summaries dataset reg summary_fun
* <list> <list> <list> <list> <list> <list> <list>
1 <language> <language> <language> <language> <symbol> <symbol> <symbol>
2 <language> <language> <language> <language> <symbol> <symbol> <symbol>
3 <language> <language> <language> <language> <symbol> <symbol> <symbol>
4 <language> <language> <language> <language> <symbol> <symbol> <symbol>
5 <language> <language> <language> <language> <symbol> <symbol> <symbol>
6 <language> <language> <language> <language> <symbol> <symbol> <symbol>
7 <language> <language> <language> <language> <symbol> <symbol> <symbol>
8 <language> <language> <language> <language> <symbol> <symbol> <symbol>
How complicated can expansions be under this framework? If each expansion needs a new dimension, would deeper nestings require 3 or more dimensions? Personally, I like the long-form convention of a target column and a command column easy and tidy to work with. Do you think we can still use that convention? Maybe also taking advantage of nest()
and unnest()
themselves somehow?
Thank you for reading! I agree that the long-form convention of drake plans is easier to work with.
The example code uses the wide format just because I thought it was easier to think about for now. In the wide-form, the x-axis is target names while the y-axis is permutations of targets.
I should have included a screenshot of View(expanded_plan)
because printing language objects is never pretty for some reason:
Each row represents a separate little "universe" -- a permutation -- where targets (named by the column name) have meaning defined by the value in that same row.
The final implementation can store this information differently, but I thought this was easy to visualize as a parallel to unnest()
. (Visualizing collapse_targets()
this way has proved more difficult.)
I would also argue that storage structure is an independent decision to make from the structure that the user sees. I envision us using S3 methods to override print()
and facilitate easier understanding of the various dimensions involved in their plan, without forcing the user to bear witness to a complex internal structure.
Additionally, dynamic branching / delayed expansion means that the complete plan
won't always be viewable pre-making. If summary_fun = complex_operation_that_returns_list()
, the expanded version will look like complex_operation_that_returns_list()[[1]]
instead of coefficients
. But we don't know the length of the list, so we can't even create a complete plan tibble...
I'm unsure what you mean by "If each expansion needs a new dimension, would deeper nestings require 3 or more dimensions?". In the image above, the plan is unnested on three "dimensions" -- three targets that can vary: dataset
, reg
, and summary_fun
. Multiple unnested dimensions just increase the number of rows, while constant targets like small
are just repeated. So the "dimensions" of the tibble (a two dimensional object) don't directly correspond to the idea of "dimensions" in the target-space.
Forgive me for being a little loose with vocabulary -- I don't have a formal compsci training and I'm sure there is some established conceptual framework for this problem that I'm not using.
Update: I started a GitHub project for this.
I have not worked on the DSL at all this past year, but I do care a lot about it.
From my end, the size and complexity of drake
's code base have been a major roadblocks. Downsizing is one of the primary development goals (with its own GitHub project). Because of #648 and issues like it, I think we are already well past apogee.
In the initial stages of the DSL, because of the sheer scope of @krlmlr's idea, I would prefer to treat the API as totally separate from how targets are declared and built. It is relatively easy to play around with how drake_plan()
parses user input, but delayed specification presents a very different and very formidable set of challenges on the implementation side.
As drake
goes through its downsizing phase, maybe we could work on the API piece in an external package first. To start off, we could make it use the proposed dplyr
syntax to create the fully-expanded plans that drake
currently requires. Once it matures enough, we could either migrate it into core drake
or leave it as its own third-party API extension.
@krlmlr, @AlexAxthelm, @dapperjapper, and @rkrug:
For https://github.com/ropensci/drake/issues/233#issue-294175780, I think we can use something that
datasets
, analyses
, and summary_funs
.drake_plan()
function itself.drake
already understands.library(drake)
drake_plan(
small = target(simulate(48), data = small),
large = target(simulate(64), data = large),
reg = target(
reg_fun(data),
do = crossing,
by = list(reg_fun, data),
reg_fun = list(reg1, reg2)
),
summary = target(
sum_fun(data, reg),
do = crossing,
by = list(sum_fun, reg),
sum_fun = list(coefficients, residuals)
),
winners = target(
min(summary),
do = summarize,
by = list(data, sum_fun)
)
)
#> # A tibble: 5 x 7
#> target command data do by reg_fun sum_fun
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 small simulate(48) small <NA> <NA> <NA> <NA>
#> 2 large simulate(64) large <NA> <NA> <NA> <NA>
#> 3 reg reg_fun(dat… <NA> crossi… list(reg_… list(reg1… <NA>
#> 4 summary sum_fun(dat… <NA> crossi… list(sum_… <NA> list(coefficien…
#> 5 winners min(summary) <NA> summar… list(data… <NA> <NA>
Created on 2019-01-13 by the reprex package (v0.2.1)
As long as I am thinking out loud:
drake_plan(dsl = TRUE)
could do the expansion up front, and the result should be compatible with the current make()
.make(dsl = TRUE)
could activate lazy expansion / dynamic branching. ...though dsl
might not be a good choice for an argument name.
A big thanks to @krlmlr for the target()
function, which opened up more doors than I realized at the time.
I will keep iterating on this. We might find a sweet spot with the right combination of language and custom columns. An improvement:
library(drake)
drake_plan(
small = simulate(48),
large = simulate(64),
reg = target(
reg_fun(data),
transform = cross(reg_fun = c(reg1, reg2), data = c(small, large))
),
summary = target(
sum_fun(data, reg),
transform = cross(sum_fun = c(coefficients, residuals), reg)
),
winners = target(
min(summary),
transform = summarize(data, sum_fun)
)
)
#> # A tibble: 5 x 3
#> target command transform
#> <chr> <chr> <chr>
#> 1 small simulate(48) <NA>
#> 2 large simulate(64) <NA>
#> 3 reg reg_fun(data) cross(reg_fun = c(reg1, reg2), data = c(small,…
#> 4 summary sum_fun(data, re… cross(sum_fun = c(coefficients, residuals), re…
#> 5 winners min(summary) summarize(data, sum_fun)
Created on 2019-01-14 by the reprex package (v0.2.1)
See #674 for an experimental API inspired by the proposed DSL. The implementation is lightweight, and because it relies on a custom "transform" column in the plan, it does not interfere with any other functionality (internals or API).
New capability: define custom groupings with a group
field in target()
:
library(drake)
plan <- drake_plan(
small = simulate(48),
large = simulate(64),
reg1 = target(
reg_fun(data),
transform = cross(data = c(small, large)),
group = reg
),
reg2 = target(
reg_fun(data),
transform = cross(data = c(small, large)),
group = reg
),
winners = target(
min(reg),
transform = summarize(data),
a = 1
)
)
plan
#> # A tibble: 8 x 3
#> target command a
#> <chr> <chr> <dbl>
#> 1 small simulate(48) NA
#> 2 large simulate(64) NA
#> 3 reg1_small reg_fun(small) NA
#> 4 reg1_large reg_fun(large) NA
#> 5 reg2_large reg1_large_fun(large) NA
#> 6 reg2_small reg1_small_fun(small) NA
#> 7 winners_large min(reg1_large = reg1_large, reg2_large = reg2_large) 1
#> 8 winners_small min(reg1_small = reg1_small, reg2_small = reg2_small) 1
drake_plan_source(plan)
#> drake_plan(
#> small = simulate(48),
#> large = simulate(64),
#> reg1_small = reg_fun(small),
#> reg1_large = reg_fun(large),
#> reg2_large = reg1_large_fun(large),
#> reg2_small = reg1_small_fun(small),
#> winners_large = target(
#> command = min(reg1_large = reg1_large, reg2_large = reg2_large),
#> a = 1
#> ),
#> winners_small = target(
#> command = min(reg1_small = reg1_small, reg2_small = reg2_small),
#> a = 1
#> )
#> )
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2019-01-16 by the reprex package (v0.2.1)
@wlandau What version is this work targeted for?
The very next release: 7.0.0
My two cents to this (FYI): So do you for now have transform and grouping? Is there likely other functionality in the future? I am not sure if it does not overload the target argument. Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea). Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.
My two cents to this (FYI): So do you for now have transform and grouping?
Yes. It is experimental (as the documentation now indicates) but behavior seems correct so far.
Is there likely other functionality in the future?
Dynamic branching is high on the list for long-term. But for this API specifically, hopefully we will not need more features. I would prefer to keep it simple, and it already seems to cover the vast majority of the use cases for the map/reduce functions and wildcards. But I could be convinced otherwise.
I am not sure if it does not overload the target argument.
The target()
function was designed to allow any custom column to be added to the plan. Behind the scenes, transformation and grouping add and manipulate custom columns, so I belive this falls within scope.
Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea).
Yes, I agree. That does not bother me so much. At the interface level specifically, this still solves the same problem as the DSL.
Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.
I plan to keep the wildcard functions around for a long time.
My personal experience with this new interface is actually more positive. It takes effort and bookkeeping to wrangle all those subplans and wildcards. I find it much easier to use transformations and grouping in a single call to drake_plan()
, and I strongly suspect that it will be a breath of fresh air to users.
Thanks @wlandau for the detailed answer, that makes sense.
You are welcome.
After talking with @krlmlr in person yesterday at RStudio conf, I have decided to think of this approach as the proper DSL. We can open a different issue for dynamic branching.
Major changes needed to consider this issue solved:
transform
and command
objects, and the return values should be detailed lists of the information content. substitute()
instead of text replacement for editing commands.command <- quote(reg_fun(x, "y", x, k))
eval(call("substitute", command, list(x = "str", k = quote(sym))))
#> reg_fun("str", "y", "str", sym)
Created on 2019-01-18 by the reprex package (v0.2.1)
Also, we should add a map()
transformation because of https://github.com/ropensci/drake/issues/235#issue-294321106.
By the way, tidy evaluation works in the DSL. You can generate super large plans this way. A taste:
sms <- rlang::syms(letters)
drake::drake_plan(x = target(f(char), transform = map(char = !!sms)))
#> # A tibble: 26 x 2
#> target command
#> <chr> <chr>
#> 1 x_a f(a)
#> 2 x_b f(b)
#> 3 x_c f(c)
#> 4 x_d f(d)
#> 5 x_e f(e)
#> 6 x_f f(f)
#> 7 x_g f(g)
#> 8 x_h f(h)
#> 9 x_i f(i)
#> 10 x_j f(j)
#> # … with 16 more rows
Created on 2019-01-20 by the reprex package (v0.2.1)
Cool. Should functions like map()
be prefixed with drake_
in order to avoid namespace conflicts with purrr?
Fortunately, we avoid namespace conflicts entirely because map() is in 'transform', not the command. The DSL code is analyzed statically and not executed in the usual sense.
Ok, I did not know that. I think the average user also may not know it. If it is not an exported function, I guess and there is also no documentation exported, i.e. ?map
won't be helpful? I think it's similar to vars()
in dplyr. You also exclusively use them within a dplyr call. However, vars()
is exported and documented.
vars()
is a new one for me. Looks like it sets a good precedent.
In the specific case of drake
, I am resistant to exporting map()
, cross()
, and reduce()
. These symbols are not actually defined in the code base, and drake
's API is already enormous.
Hopefully we can make the documentation friendly and thorough. I just pushed update to the ?drake_plan()
help file and added a new section in the manual.
tibble()
,crossing()
,mutate()
,group_by()
,summarize()
plan_analyses()
andplan_summaries()
Sketch for "basic" example
We might want to use our own verbs, this is the current tidyverse nomenclature to communicate semantics.