ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

DSL based on dplyr-like verbs? #233

Closed krlmlr closed 5 years ago

krlmlr commented 6 years ago

Sketch for "basic" example

We might want to use our own verbs, this is the current tidyverse nomenclature to communicate semantics.

drake_plan(
  small = simulate(48),
  large = simulate(64),
  datasets = tibble(dataset = list(small, large)),
  regressions = tibble(reg = list(reg1, reg2)),
  analyses = crossing(datasets, regressions) %>%
    rowwise() %>% 
    mutate(result = list(reg(dataset))),
  summary_funs = tibble(fun = list(coefficients, residuals)),
  summaries = crossing(analyses, summary_funs) %>%
    rowwise() %>% 
    mutate(summary = fun(dataset, result)),
  winners = summaries %>%
    group_by(dataset, fun) %>%
    summarize(winner = min(summary))
)
krlmlr commented 6 years ago

@AlexAxthelm: Just to clarify: We use these verbs only to describe the workflow, the actual execution plan will be a full expansion. But if we know that e.g. all analyses have been built, we don't need to look at the full expansion at all. We still need to figure out how to handle all this internally.

AlexAxthelm commented 6 years ago

You may not have a clear answer for this, but as a design question, would each element (row) inside one of these tibble targets show separately on outdated, or the graph? Or would all of analyses need rebuilt if a single one fell outdated? Not trying to be difficult, just having trouble imagining how this will play out in execution.

krlmlr commented 6 years ago

These questions are great!

We have an internal graph, there each row in analyses corresponds to an internal target which has its separate up-to-dateness. We never materialize the full internal graph, but to build analyses we of course look at all internal dependencies. We don't need to update those rows in analyses that are up to date, and we don't materialize the entire analyses data frame either unless the user asks for it. For visualization, we can add a "partially outdated" state if only some rows in analyses are stale.

AlexAxthelm commented 6 years ago

That sounds awesome. I'm fully in support of this over #77 then. A feature that I would find useful as a user, would be to have an accessible estimate on how outdated a target is, e.g 7 of 9 elements need rebuilt. Other than that, I think all my concerns are met.

wlandau commented 6 years ago

I am having trouble understanding this discussion, possibly because I do not use dplyr very much. Previously, I thought we were using the DSL to expand out a whole workflow plan data frame and then feed it to make() as usual. How long exactly will the expansion be delayed?

krlmlr commented 6 years ago

That's a point I haven't stressed enough. We use the DSL to avoid plan expansion for as long as we possibly can.

wlandau commented 6 years ago

That would certainly gain us efficiency, and it would solve the catch-22 from #77 (group_by() without having the data yet). It's also a hard and delicate problem because we will need to make changes to our internal igraph object (config$graph) while make() is happening. All the current parallel backends require everything to be planned in advance, so I think we will need #227 first. Maybe we could work on the DSL interface and #227 at the same time. Then when #227 is solved, we could remove most of the other backends and delay expansion for as long as possible.

wlandau commented 6 years ago

Thankfully, none of this should affect the cache, which is the most sensitive component of all when it comes to back compatibility.

AlexAxthelm commented 6 years ago

@krlmlr, to confirm, the drake_plan sktech that you're outlined above, will it yield a plan that looks like what we currently see when we call load_basic_example(), or will it give something more along the lines of what we see if we call that now (current version of drake)? I'm worried that there is some weirdness going on with the mutate(result = list(reg(dataset))) which is intuitive, but I don't know how to look for reg aside from searching for column names in dependencies, but then that is assuming that everything is a tibble? I guess I'm back to being a bit unclear on if the new plans are aimed at creating targets that are tibbles, or creating plans that are tibbles?

This seems like there is potential for a lot of good power here, but I'm concerned about imposing a dplyr point-of-view on an otherwise agnostic tool.

wlandau commented 6 years ago

For the sake of slow deprecation, I think we will get the chance to see how well the two interfaces coexist. Ideally, I would like drake to still be able to easily handle a data frame of text target names and text commands. I believe this format is very reachable for people new to R, and parsing is usually easy.

There is nothing incorrect about wildcard templating, but I think we could eventually offload it to the wildcard package and extend wildcard to handle tibbles of language objects. See #240 and wlandau/wildcard#7.

krlmlr commented 6 years ago

I'd say everything's a tibble, but I'm not sure about details here.

I need to take a closer look at wildcard.

wlandau commented 6 years ago

There's not actually that much to look at, it's a simple idea I took out of remakeGenerator.

wlandau commented 6 years ago

I would be in favor of turning ordinary workflow data frames into tibbles as early on as possible. It really is about time.

Speaking of tibbles, there was a discussion somewhere about fixed column width printing, but I can't seem to find it. Workflow plan commands can get long.

krlmlr commented 6 years ago

Character columns take as much space as they can get in tibble, but embedded newlines are not a problem. Does that answer your question?

AlexAxthelm commented 6 years ago

I think I finally figured out how to express my concerns about all targets being tibbles: What happens when I want to build something which is normally a tibble?

drake_plan(
foo = mtcars %>% as.tibble() %>% filter(am == 1)
)

will this try to bring all of the parallelism to bear on each row of the tibble? Is there a way to separate a a tibble which is "just a tibble" from one that is "a drake_plan™ tibble"? What happens if I want to use NSE to build my plan as a tibble outside of drake_plan?

krlmlr commented 6 years ago

We might need to invent verbs like drake_mutate() and drake_rowwise() etc. to avoid this confusion. Or we enter "drake" mode with another verb.

violetcereza commented 6 years ago

I am all for this! Dynamic branching ("delayed expansion"?) is very exciting and I recently ran into a problem with my data work that required it.

I think an entirely different set of functions should be used, so that it's clear what is a tibble operation within a target, and which is an operation involving many drake targets. I suppose these functions would only be valid inside a target definition, like n() or everything().


plan_drake(
  small = simulate(48),
  large = simulate(64),

  analyses = expand_target(
    reg(dataset),
    # crossing() is implicit
    reg = list(reg1, reg2),
    dataset = list(small, large)
  ),
  summaries = expand_target(
    # I wasn't sure why you had fun(dataset, result) here
    # Not yet sure how to include that in this proposal
    fun(result),
    fun = list(coefficients, residuals),
    result = analyses
  ),
  # gather_targets evaluates to a tibble
  # with a column for every expansion term
  # used previously I guess? & value column
  winners = gather_targets(summaries) %>%
    group_by(dataset, fun) %>%
    summarize(winner = min(value))
)

Expanding on this idea of special functions within target definitions, I've been imagining a feature where you could specify file targets & file dependencies from target defs instead of messing around with the single-quotes vs double-quotes thing. Or even triggers. EDIT: krlmlr had the same idea in #232 oops

plan_drake(
  imported_data = read_csv(target_filedep("data.csv")),
  report.md = target_file(knit(x)),
  always_build = target_trigger("always", fun(x))
)
krlmlr commented 6 years ago

Need to keep in mind multi-file targets (#283) or directory targets (#12), and their R equivalents (named lists, recursive lists).

rkrug commented 6 years ago

Let me jump into this discussion. Even though I use R regularly, I am still (heavily?) confused by the whole tidyverse stuff and the verbs you are discussing (maybe because I don't use the tidyverse daily?).

I personally think that the logic of the drake plan, and the verbs being discussed here, should be easily understandable by even non-tidyverse users. The tidyverse verbs seem to be great, but each time I have to use any of it's functions, I have to search for examples to understand them.

In a nutshell: I don't think a fluency in the tidyverse logic should be a prerequisite to use these verbs discussed here.

I would opt to trying to stick to the basic R logic as far as possible when building / using these verbs initially. This would open these features to many more users, and one could still add this tidyverse logic afterwards, probably as an additional package. One could even define an api to add user-defined verbs, and have different packages which can be loaded which contain these definitions - which would be, thinking about it, the most flexible approach to solve this problem.

wlandau commented 6 years ago

From @rkrug via #298: we should think about a verb to chain targets together sequentially.

wlandau commented 6 years ago

Yes, the ideas come from the tidyverse, but no prior tidyverse knowledge will be required. As I understand it, the verbs for creating plans will stand on their own, and documentation in drake will spell out what they mean.

rkrug commented 6 years ago

OK - that would be very useful.

violetcereza commented 6 years ago

Hi all --

I've been brainstorming an alternative DSL recently and I have a first draft here: https://github.com/dapperjapper/drake/wiki/DSL-Proposal

It takes from tidyr's nest() and unnest() concepts, and the prototype interpreter I created actually uses unnest().

Basically, dynamic branching / delayed expansion is made intuitive by turning all expansion vectors into targets themselves. These targets can be "iterated" over by unnesting them with expand_targets(). Here is the most simple version of this pattern:

plan_df(
  iterator = c(1, 4, 5)
) %>%
  expand_targets(iterator) %>%
  add_to_plan(
    squared = iterator*iterator
  )

Where there are three different versions of the target squared, depending on the different interators. If multiplication was extremely computationally expensive, each squared could be computed as a separate job via any drake parallelism.

gathering targets is equivalent to a nest() operation (called collapse_targets here):

plan_df(
  iterator = c(1, 4, 5)
) %>%
  expand_targets(iterator) %>%
  add_to_plan(
    squared = iterator*iterator
  ) %>%
  collapse_targets(iterator) %>%
  add_to_plan(
    sum_squares = sum(unlist(squared))
  )

I believe this to be a little bit easier to read and reason about compared to @krlmlr's (very inspiring) meta-make vision. Implementation will be a challenge... but do ppl think this is easier to understand?

wlandau commented 6 years ago

Thanks, @dapperjapper! Sorry for my slow response. I have been waiting for a large chunk of quality time to seriously think about this. I will do my best to post feedback this week.

wlandau commented 6 years ago

I looked at your proposal and tried out the code. Although I still have not wrapped my head around the details, I feel like I understand the concept a bit better. It seems like we group targets into lists, define dependency relationships among the groupings, and then expand things out at the last minute. The working code really helped. I like the analogy with nest() and unnest().

I find the data structures in this proposal somewhat unfamiliar in this use case. What is your current thinking on those wide-form tibbles?

> plan
# A tibble: 1 x 7
  small      large      dataset    reg        summary_fun analysis   summaries 
  <list>     <list>     <list>     <list>     <list>      <list>     <list>    
1 <language> <language> <language> <language> <language>  <language> <language>

It seems like the intent is to generate rows for the expansion.

> collect_plan(expanded_plan)
# A tibble: 8 x 7
  small      large      analysis   summaries  dataset  reg      summary_fun
* <list>     <list>     <list>     <list>     <list>   <list>   <list>     
1 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
2 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
3 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
4 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
5 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
6 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
7 <language> <language> <language> <language> <symbol> <symbol> <symbol>   
8 <language> <language> <language> <language> <symbol> <symbol> <symbol>  

How complicated can expansions be under this framework? If each expansion needs a new dimension, would deeper nestings require 3 or more dimensions? Personally, I like the long-form convention of a target column and a command column easy and tidy to work with. Do you think we can still use that convention? Maybe also taking advantage of nest() and unnest() themselves somehow?

violetcereza commented 6 years ago

Thank you for reading! I agree that the long-form convention of drake plans is easier to work with.

The example code uses the wide format just because I thought it was easier to think about for now. In the wide-form, the x-axis is target names while the y-axis is permutations of targets.

I should have included a screenshot of View(expanded_plan) because printing language objects is never pretty for some reason:

image

Each row represents a separate little "universe" -- a permutation -- where targets (named by the column name) have meaning defined by the value in that same row.

The final implementation can store this information differently, but I thought this was easy to visualize as a parallel to unnest(). (Visualizing collapse_targets() this way has proved more difficult.)

I would also argue that storage structure is an independent decision to make from the structure that the user sees. I envision us using S3 methods to override print() and facilitate easier understanding of the various dimensions involved in their plan, without forcing the user to bear witness to a complex internal structure.

Additionally, dynamic branching / delayed expansion means that the complete plan won't always be viewable pre-making. If summary_fun = complex_operation_that_returns_list(), the expanded version will look like complex_operation_that_returns_list()[[1]] instead of coefficients. But we don't know the length of the list, so we can't even create a complete plan tibble...

I'm unsure what you mean by "If each expansion needs a new dimension, would deeper nestings require 3 or more dimensions?". In the image above, the plan is unnested on three "dimensions" -- three targets that can vary: dataset, reg, and summary_fun. Multiple unnested dimensions just increase the number of rows, while constant targets like small are just repeated. So the "dimensions" of the tibble (a two dimensional object) don't directly correspond to the idea of "dimensions" in the target-space.

Forgive me for being a little loose with vocabulary -- I don't have a formal compsci training and I'm sure there is some established conceptual framework for this problem that I'm not using.

wlandau commented 5 years ago

Update: I started a GitHub project for this.

I have not worked on the DSL at all this past year, but I do care a lot about it.

From my end, the size and complexity of drake's code base have been a major roadblocks. Downsizing is one of the primary development goals (with its own GitHub project). Because of #648 and issues like it, I think we are already well past apogee.

In the initial stages of the DSL, because of the sheer scope of @krlmlr's idea, I would prefer to treat the API as totally separate from how targets are declared and built. It is relatively easy to play around with how drake_plan() parses user input, but delayed specification presents a very different and very formidable set of challenges on the implementation side.

As drake goes through its downsizing phase, maybe we could work on the API piece in an external package first. To start off, we could make it use the proposed dplyr syntax to create the fully-expanded plans that drake currently requires. Once it matures enough, we could either migrate it into core drake or leave it as its own third-party API extension.

wlandau commented 5 years ago

@krlmlr, @AlexAxthelm, @dapperjapper, and @rkrug:

For https://github.com/ropensci/drake/issues/233#issue-294175780, I think we can use something that

  1. Avoids creating extra rows for datasets, analyses, and summary_funs.
  2. Does not require any new special API functions.
  3. Requires minimal to no change to the drake_plan() function itself.
  4. Captures all the information we need.
  5. drake already understands.
library(drake)
drake_plan(
  small = target(simulate(48), data = small),
  large = target(simulate(64), data = large),
  reg = target(
    reg_fun(data),
    do = crossing,
    by = list(reg_fun, data),
    reg_fun = list(reg1, reg2)
  ),
  summary = target(
    sum_fun(data, reg),
    do = crossing,
    by = list(sum_fun, reg),
    sum_fun = list(coefficients, residuals)
  ),
  winners = target(
    min(summary),
    do = summarize,
    by = list(data, sum_fun)
  )
)
#> # A tibble: 5 x 7
#>   target  command      data  do      by         reg_fun    sum_fun         
#>   <chr>   <chr>        <chr> <chr>   <chr>      <chr>      <chr>           
#> 1 small   simulate(48) small <NA>    <NA>       <NA>       <NA>            
#> 2 large   simulate(64) large <NA>    <NA>       <NA>       <NA>            
#> 3 reg     reg_fun(dat… <NA>  crossi… list(reg_… list(reg1… <NA>            
#> 4 summary sum_fun(dat… <NA>  crossi… list(sum_… <NA>       list(coefficien…
#> 5 winners min(summary) <NA>  summar… list(data… <NA>       <NA>

Created on 2019-01-13 by the reprex package (v0.2.1)

As long as I am thinking out loud:

...though dsl might not be a good choice for an argument name.

wlandau commented 5 years ago

A big thanks to @krlmlr for the target() function, which opened up more doors than I realized at the time.

wlandau commented 5 years ago

I will keep iterating on this. We might find a sweet spot with the right combination of language and custom columns. An improvement:

library(drake)
drake_plan(
  small = simulate(48),
  large = simulate(64),
  reg = target(
    reg_fun(data),
    transform = cross(reg_fun = c(reg1, reg2), data = c(small, large))
  ),
  summary = target(
    sum_fun(data, reg),
    transform = cross(sum_fun = c(coefficients, residuals), reg)
  ),
  winners = target(
    min(summary),
    transform = summarize(data, sum_fun)
  )
)
#> # A tibble: 5 x 3
#>   target  command           transform                                      
#>   <chr>   <chr>             <chr>                                          
#> 1 small   simulate(48)      <NA>                                           
#> 2 large   simulate(64)      <NA>                                           
#> 3 reg     reg_fun(data)     cross(reg_fun = c(reg1, reg2), data = c(small,…
#> 4 summary sum_fun(data, re… cross(sum_fun = c(coefficients, residuals), re…
#> 5 winners min(summary)      summarize(data, sum_fun)

Created on 2019-01-14 by the reprex package (v0.2.1)

wlandau commented 5 years ago

See #674 for an experimental API inspired by the proposed DSL. The implementation is lightweight, and because it relies on a custom "transform" column in the plan, it does not interfere with any other functionality (internals or API).

wlandau commented 5 years ago

New capability: define custom groupings with a group field in target():

library(drake)
plan <- drake_plan(
  small = simulate(48),
  large = simulate(64),
  reg1 = target(
    reg_fun(data),
    transform = cross(data = c(small, large)),
    group = reg
  ),
  reg2 = target(
    reg_fun(data),
    transform = cross(data = c(small, large)),
    group = reg
  ),
  winners = target(
    min(reg),
    transform = summarize(data),
    a = 1
  )
)

plan
#> # A tibble: 8 x 3
#>   target        command                                                   a
#>   <chr>         <chr>                                                 <dbl>
#> 1 small         simulate(48)                                             NA
#> 2 large         simulate(64)                                             NA
#> 3 reg1_small    reg_fun(small)                                           NA
#> 4 reg1_large    reg_fun(large)                                           NA
#> 5 reg2_large    reg1_large_fun(large)                                    NA
#> 6 reg2_small    reg1_small_fun(small)                                    NA
#> 7 winners_large min(reg1_large = reg1_large, reg2_large = reg2_large)     1
#> 8 winners_small min(reg1_small = reg1_small, reg2_small = reg2_small)     1

drake_plan_source(plan)
#> drake_plan(
#>   small = simulate(48),
#>   large = simulate(64),
#>   reg1_small = reg_fun(small),
#>   reg1_large = reg_fun(large),
#>   reg2_large = reg1_large_fun(large),
#>   reg2_small = reg1_small_fun(small),
#>   winners_large = target(
#>     command = min(reg1_large = reg1_large, reg2_large = reg2_large),
#>     a = 1
#>   ),
#>   winners_small = target(
#>     command = min(reg1_small = reg1_small, reg2_small = reg2_small),
#>     a = 1
#>   )
#> )

config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-01-16 by the reprex package (v0.2.1)

bpbond commented 5 years ago

@wlandau What version is this work targeted for?

wlandau commented 5 years ago

The very next release: 7.0.0

lorenzwalthert commented 5 years ago

My two cents to this (FYI): So do you for now have transform and grouping? Is there likely other functionality in the future? I am not sure if it does not overload the target argument. Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea). Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.

wlandau commented 5 years ago

My two cents to this (FYI): So do you for now have transform and grouping?

Yes. It is experimental (as the documentation now indicates) but behavior seems correct so far.

Is there likely other functionality in the future?

Dynamic branching is high on the list for long-term. But for this API specifically, hopefully we will not need more features. I would prefer to keep it simple, and it already seems to cover the vast majority of the use cases for the map/reduce functions and wildcards. But I could be convinced otherwise.

I am not sure if it does not overload the target argument.

The target() function was designed to allow any custom column to be added to the plan. Behind the scenes, transformation and grouping add and manipulate custom columns, so I belive this falls within scope.

Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea).

Yes, I agree. That does not bother me so much. At the interface level specifically, this still solves the same problem as the DSL.

Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me.

I plan to keep the wildcard functions around for a long time.

My personal experience with this new interface is actually more positive. It takes effort and bookkeeping to wrangle all those subplans and wildcards. I find it much easier to use transformations and grouping in a single call to drake_plan(), and I strongly suspect that it will be a breath of fresh air to users.

lorenzwalthert commented 5 years ago

Thanks @wlandau for the detailed answer, that makes sense.

wlandau commented 5 years ago

You are welcome.

After talking with @krlmlr in person yesterday at RStudio conf, I have decided to think of this approach as the proper DSL. We can open a different issue for dynamic branching.

Major changes needed to consider this issue solved:

  1. I need to read Hadley's chapter on DSLs and refactor the internals of managing and parsing inputs. For example, there should be S3 methods to parse the transform and command objects, and the return values should be detailed lists of the information content.
  2. Use substitute() instead of text replacement for editing commands.
command <- quote(reg_fun(x, "y", x, k))
eval(call("substitute", command, list(x = "str", k = quote(sym))))
#> reg_fun("str", "y", "str", sym)

Created on 2019-01-18 by the reprex package (v0.2.1)

wlandau commented 5 years ago

Also, we should add a map() transformation because of https://github.com/ropensci/drake/issues/235#issue-294321106.

wlandau commented 5 years ago

By the way, tidy evaluation works in the DSL. You can generate super large plans this way. A taste:

sms <- rlang::syms(letters)
drake::drake_plan(x = target(f(char), transform = map(char = !!sms)))
#> # A tibble: 26 x 2
#>    target command
#>    <chr>  <chr>  
#>  1 x_a    f(a)   
#>  2 x_b    f(b)   
#>  3 x_c    f(c)   
#>  4 x_d    f(d)   
#>  5 x_e    f(e)   
#>  6 x_f    f(f)   
#>  7 x_g    f(g)   
#>  8 x_h    f(h)   
#>  9 x_i    f(i)   
#> 10 x_j    f(j)   
#> # … with 16 more rows

Created on 2019-01-20 by the reprex package (v0.2.1)

lorenzwalthert commented 5 years ago

Cool. Should functions like map() be prefixed with drake_ in order to avoid namespace conflicts with purrr?

wlandau commented 5 years ago

Fortunately, we avoid namespace conflicts entirely because map() is in 'transform', not the command. The DSL code is analyzed statically and not executed in the usual sense.

lorenzwalthert commented 5 years ago

Ok, I did not know that. I think the average user also may not know it. If it is not an exported function, I guess and there is also no documentation exported, i.e. ?map won't be helpful? I think it's similar to vars() in dplyr. You also exclusively use them within a dplyr call. However, vars() is exported and documented.

wlandau commented 5 years ago

vars() is a new one for me. Looks like it sets a good precedent.

In the specific case of drake, I am resistant to exporting map(), cross(), and reduce(). These symbols are not actually defined in the code base, and drake's API is already enormous.

Hopefully we can make the documentation friendly and thorough. I just pushed update to the ?drake_plan() help file and added a new section in the manual.