ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

Expand functionality for formatting target names using `.id` (in static branching) #1220

Closed ha0ye closed 4 years ago

ha0ye commented 4 years ago

Prework

Proposal

In my use case, I am constructing large plans using cross() to get all {analysis} x {data} combinations. Because some datasets and analysis targets already have underscores, the default target name is difficult to parse (i.e. using regex) to extract back the name of the underlying {analysis} or {data} name.

Possible Solution 1

Expand the syntax for the .id argument to e.g. use the glue package, or something similar:

p <- drake_plan(
    analysis = target(
        fun(data),
        transform = cross(fun = c(str, names),
                          data = c(mtcars, iris), 
                          .id = "{fun} %on% {data}")
    )
)

setequal(p$target, c("str %on% mtcars", 
                      "names %on% mtcars", 
                      "str %on% iris", 
                      "names %on% iris"))
# TRUE

Current work-around

My current work-around is to specify .id = data, and then pass in .id_chr into my function that collects the results:

drake_plan(
    analysis = target(
        fun(data),
        transform = cross(fun = c(str, names),
                          data = c(mtcars, iris), 
                          .id = data)
    ),
    results = target(collect(list(analysis), .target_name = .id_chr),
                     transform = combine(analysis, .by = fun))
)

where collect() extracts the {data} name from its first argument, and then the {analysis} name from the second argument (making use of .id_chr).

Possible Solution 2

This suggests an alternative solution, which is to expand the plan information that is accessible by commands when building targets. For example, if it were possible to access the (hidden) columns fun and data that show up when the plan is constructed with trace = TRUE, that would also facilitate making "metadata" from the plan visible to commands.

brendanf commented 4 years ago

You can access the value of fun (or any other trace columns) from within your target command if you include them in your transform; in this case combine(analysis, fun, .by = fun):

library(drake)
plan <- drake_plan(
  analysis = target(
    fun(data),
    transform = cross(fun = c(str, names),
                      data = c(mtcars, iris), 
                      .id = data)
  ),
  results = target(list(analysis, fun = fun),
                   transform = combine(analysis, fun, .by = fun))
)
make(plan)
#> target analysis_iris
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> target analysis_mtcars
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
#> target analysis_iris_2
#> target analysis_mtcars_2
#> target results_str
#> target results_names

readd(results_names)
#> [[1]]
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
#> 
#> [[2]]
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> 
#> $fun
#> function (x)  .Primitive("names")
readd(results_str)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> $fun
#> function (object, ...) 
#> UseMethod("str")
#> <bytecode: 0x55716920ca78>
#> <environment: namespace:utils>

Created on 2020-03-19 by the reprex package (v0.3.0)

Since your .by variable is a function, this isn't very legible, but you can quote it if you need to.

library(drake)
plan <- drake_plan(
  analysis = target(
    fun(data),
    transform = cross(fun = c(str, names),
                      data = c(mtcars, iris), 
                      .id = data)
  ),
  results = target(list(analysis, fun = quote(fun)),
                   transform = combine(analysis, fun, .by = fun))
)
make(plan)
#> In drake, consider r_make() instead of make(). r_make() runs make() in a fresh R session for enhanced robustness and reproducibility.
#> target analysis_iris
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> target analysis_mtcars
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
#> target analysis_iris_2
#> target analysis_mtcars_2
#> target results_str
#> target results_names

readd(results_names)
#> [[1]]
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"
#> 
#> [[2]]
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> 
#> $fun
#> names
readd(results_str)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> $fun
#> str

Created on 2020-03-19 by the reprex package (v0.3.0)

wlandau commented 4 years ago

I agree with @brendanf that this sort of tracking is best handled with prospective labeling in custom commands and functions. deparse() and quote() work really well together here.

library(drake)
library(tibble)

plan <- drake_plan(
  analysis = target(
    tibble(
      value = fun(data),
      fun = deparse(quote(fun)),
      data = deparse(quote(data))
    ),
    transform = cross(
      fun = c(nrow, ncol),
      data = c(mtcars, iris)
    )
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   analysis_nrow_mtcars = tibble(value = nrow(mtcars), fun = deparse(quote(nrow)), data = deparse(quote(mtcars))),
#>   analysis_ncol_mtcars = tibble(value = ncol(mtcars), fun = deparse(quote(ncol)), data = deparse(quote(mtcars))),
#>   analysis_nrow_iris = tibble(value = nrow(iris), fun = deparse(quote(nrow)), data = deparse(quote(iris))),
#>   analysis_ncol_iris = tibble(value = ncol(iris), fun = deparse(quote(ncol)), data = deparse(quote(iris)))
#> )

make(plan)
#> ▶ target analysis_ncol_mtcars
#> ▶ target analysis_nrow_iris
#> ▶ target analysis_nrow_mtcars
#> ▶ target analysis_ncol_iris

readd(analysis_ncol_mtcars)
#> # A tibble: 1 x 3
#>   value fun   data  
#>   <int> <chr> <chr> 
#> 1    11 ncol  mtcars

Created on 2020-03-20 by the reprex package (v0.3.0)