ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

What is best practice for using drake with params in an .Rmd? #1348

Closed charlesbaillie closed 3 years ago

charlesbaillie commented 3 years ago

Background

If I wanted to produce a number of reports from one template, without drake I would do something like:

# Create a vector of the elements to iterate over
species <- c("Adelie","Chinstrap","Gentoo")

# Render to HTML the template for each param
purrr::map(
  .x = species,  # vector of param values
  .f = ~render(
    input = "doc/template.Rmd",  # RMarkdown filepath
    params = list(name = .x),  # iterated parameter value
    output_file = paste0("doc/", .x, ".html")  # iterated output path
    )
  )
)

Using static branching in drake I can get something similar, where each report target is a separate row in the plan, but it feels a little hacky making the paths first (since file_out() and friends only take strings), and in the plan in the below example the target 'files' isn't connected to anything. Also, is it correct to have each report as a separate target, or should it just be one target since all reports will need to be updated when the data is updated anyway? In reality I'm going to have >150 rmarkdown reports and potentially other outputs like slides decks, hence why in doc/ I'd like to keep just the 'templates', and then reports/ will have the rendered reports or slides.

Example

This is my hack using static branching:

library(palmerpenguins)

plan <- drake_plan(

penguin_data = penguins %>% group_by(species) %>% 
  summarise_if(is.numeric,list(min, max)) %>% mutate_at(vars(species),as.character),

files = data.frame(species = penguin_data$species, 
                   path = paste0("report/report_", penguin_data$species, ".html")),

report = target(
  render(input = knitr_in(input), output_file = file_out(output), 
         params = list(species = p)),
  transform = map(
    input="doc/template.Rmd",
    output=!!files$path,
    p = !!files$species,
    .names = paste0("report_", !!files$species)
  )
)

)

I have a drake project set up like this:

_drake.R
report.Rmd
packages.R
R/
├── functions.R
└── plan.R
doc/
└── template.Rmd
report/
├── report_gentoo.Rmd
├── report_chinstrap.Rmd
└── report_adelie.Rmd
data/
└──file1.csv

image

wlandau commented 3 years ago

Using static branching in drake I can get something similar, where each report target is a separate row in the plan, but it feels a little hacky making the paths first (since file_out() and friends only take strings), and in the plan in the below example the target 'files' isn't connected to anything.

Looks like you are trying to define new targets based on the values of previous targets, which means we don't know what the files of the reports are going to be until the upstream targets run. In that case, I would go with dynamic branching. That means dynamic files (format = "file") are appropriate because counterintuitively file_out() is incompatible with dynamic branching. But knitr_in() should be if none of the dependencies in the reports are dynamic sub-targets. Sketch:

plan <- drake_plan(
  penguin_data = penguins %>% group_by(species) %>% 
    summarise_if(is.numeric,list(min, max)) %>% mutate_at(vars(species),as.character),
  files = data.frame(
    species = penguin_data$species, 
    path = paste0("report/report_", penguin_data$species, ".html")
  ),
  report = target({
      render(
        input = knitr_in("doc/template.Rmd"),
        output_file = files$path, 
        params = list(species = files$species)
      )
      # Just underscoring here that the output path should be returned for format = "file".
      # rmarkdown::render() does that anyway.
      files$path 
    },
    format = "file", # Track the returned output file path.
    dynamic = map(files) # Maps over the rows and makes one sub-target per row.
  )
)

Also, is it correct to have each report as a separate target, or should it just be one target since all reports will need to be updated when the data is updated anyway? In reality I'm going to have >150 rmarkdown reports and potentially other outputs like slides decks, hence why in doc/ I'd like to keep just the 'templates', and then reports/ will have the rendered reports or slides.

Your choice. It depends on how long the computation is. Also, if all 150 reports are quick and they all tend to invalidate all at once (either all are outdated or none are outdated at any given time) they you might as well put all them in a single target.