trouble combining drake and bookdown to generate parameterized reports

robitalec commented 3 years ago

Prework

[X] Read and agree to the code of conduct and contributing guidelines.
[X] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[X] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [X] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [X] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [X] Readable: format your code according to the tidyverse style guide.

Description

I have been working on using drake to generate parameterized reports in rmarkdown/bookdown. Since bookdown is the more recent way of combining multiple rmarkdown files and is a package I'm familiar with, I decided to focus on that route in this issue. I have also looked at rmarkdown::render_site, setting yaml parameters for each or combining eg. PDF files post rendering. render_site requires manually updating the navbar, so that is an extra step that I'd rather avoid doing programmaticallly.

My goal is to use dynamic branching with a directory of input files, to generate a chapter in a bookdown book for each dataset. Each chapter would have the same layout, the only difference would be the datasets shown.

A note: I understand this is sort of outside the scope of drake, and I'm only looking for help in the drake relevant parts. Let me know if it's too much though. It feels like a bit of a complicated process, because bookdown doesn't have a native way of using a template with a list of different parameters to generate chapters.

It took me a bit of time to get sorted in terms of what the expectations are for bookdown and how that relates to drake. bookdown expects: an index.Rmd with YAML, separate .Rmd files for each chapter with a first level header (and without YAML) and a _bookdown.yml file with site settings. So I have been working on this, trying to find the least hacky way of doing it and I'm not really satisfied - but it works for now. I just use a _template.Rmd file and copy it + gsub the datasets in for each target of the dynamic branch (a unique dataset).

The challenges I'm hitting are 1) declaring the dependency between the template file and this gsub step and 2) connecting this step to bookdown::render_book.

Here's my visgraph. Note that the bookdown::render_book step fails because the upstream (though not declared in the plan) steps haven't run.

My questions are:

how can I declare a dependency on the template Rmarkdown file in a dynamic branch?
how can I point to upstream dependencies of Rmarkdown files when they aren't detected automatically?

Alternatively, have you seen anything that solves this problem in a different way? Thank you!

Reproducible example

[X] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [X] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [X] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [X] Readable: format your code according to the tidyverse style guide.

_drake.R

source("R/functions.R")
source("R/plan.R")

drake_config(plan)

R/plan.R

library(drake)

files <- dir('data', pattern = '.csv', full.names = TRUE)

plan <- drake_plan(
  reads = target(read_data(files), dynamic = map(files)),
  means = target(mean_data(reads, id_chr()), dynamic = map(reads)),
  fills = target(fill_placeholders('_template.Rmd', data = means),
                 dynamic = map(means)),
  render = bookdown::render_book(knitr_in('index.Rmd'))
)

R/functions.R

read_data <- function(data) {
  DT <- read.csv(data)
  DT$path <- data
  DT
}

mean_data <- function(data, id) {
  data.frame(meanmpg = mean(data$mpg), id = id, path = data$path[[1]])
}

fill_placeholders <- function(template, data) {
  key <- strsplit(data$id, '_')[[1]][[2]]
  lns <- readLines(template)
  lns <- gsub('.id', data$id, lns)
  lns <- gsub('.title', data$path, lns)
  writeLines(lns, paste0('md/', key, '.Rmd'))
}

_bookdown.yml

rmd_subdir: ["md/"]
delete_merged_file: true

index.Rmd

---
title: "Test parameterized reports with drake"
author: "Alec L. Robitaille"
date: "`r Sys.Date()`"
site: "bookdown::bookdown_site"
output:
 bookdown::gitbook: default
documentclass: book
---

# Test

_template.Rmd

#  .title

```{r}
drake::readd(.id)


There are two datasets used, the result of `write.csv(data/mtcars[1:5,])` and the same for [6:10,].

. ├── data/ │ ├── mtcars_1-5.csv │ └── mtcars_6-10.csv ├── md/ ├── R/ │ ├── functions.R │ └── plan.R ├── _drake.R ├── _bookdown.yml ├── index.Rmd └── _template.Rmd


## Desired result

An output bookdown book that has a chapter for each dataset in an input folder, generated using a drake plan. 

## Session info

R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux

Matrix products: default BLAS: /usr/lib/libopenblasp-r0.3.10.so LAPACK: /usr/lib/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.0.3 htmltools_0.5.0 tools_4.0.3 yaml_2.2.1
[5] rmarkdown_2.3 knitr_1.30 xfun_0.17 digest_0.6.25
[9] rlang_0.4.7 evaluate_0.14

robitalec commented 3 years ago

There is a shorter way to describe this problem:

Because I am generating the Rmarkdown files from a dynamic branch, and rendering all of them with bookdown::render_book, the dependencies to drake (sub)targets can't be detected automatically with knitr_in and I'm not sure how to explicitly declare them.

For example, one of the Rmarkdown files generated by the "fills" step:

236020fc.Rmd

#  data/mtcars_6-10.csv

```{r}
drake::readd(means_236020fc)

wlandau commented 3 years ago

My general advice in times like this is to have drake do what it's good at and R Markdown do what it's good at. drake is designed to reduce repetitive computation, and you see the greatest degree of benefit in situations where that computation is long. R Markdown is good at mixing prose and results, but it is not a true pipeline tool and it struggles with cumbersome code and long runtimes. So in the drake plan, the upstream targets should do the intense work - preprocessing large data, fitting complicated models, etc. - and then have downstream targets to produce nice figures and tables. In the R Markdown, you can call them explicitly with loadd() and/or readd(). That way, each chapter looks as clean as possible and runs as fast as possible. If all you are doing is reading targets, the whole bookdown report should take next to no time at all, so you can throw all those reports into a single target. Sketch:

plan <- drake_plan(
   data = get_data(),
  model = fit_model(data),
  book = your_custom_book_function(
    sources = knitr_in(
      "index.Rmd",
      "chapter1.Rmd",
      "chapter2.Rmd"
    ),
    output = file_out("final_book_file.html")
  )
)

where your_custom_book_function() is a function that runs the entire bookdown book.

If those reports are parameterized, you could reference parameters instead of loadd()'ing targets and instead declare them as dependencies through the plan.

plan <- drake_plan(
   data = get_data(),
  model = fit_model(data),
  book = your_custom_book_function(
    sources = knitr_in(
      "index.Rmd",
      "chapter1.Rmd",
      "chapter2.Rmd"
    ),
    output = file_out("final_book_file.html"),
    deps = c(data, model) # supplied to R Markdown parameters using custom code
  )
)

targets has better support for parameterized reports through tarchetypes::tar_render(), but that might not be much help for bookdown. bookdown + parameterized reports does not seem like a very common use case.

robitalec commented 3 years ago

Thanks @wlandau. I just noticed targets, really excited to give it a try.

In your chunk you list "index.Rmd", "chapter1.Rmd", "chapter2.Rmd" in knitr_in. If I am using a dynamic target for a list of input files, and let's say either I have thousands of input files or some are added/deleted... How can I use knitr_in?

knitr_in(dir('md'))

Detected knitr_in(dir("md/")). File paths in file_in(), file_out(), and knitr_in() must be literal strings, not variables. For example, file_in("file1.csv", "file2.csv") is legal, but file_in(paste0(filename_variable, ".csv")) is not. Details: https://books.ropensci.org/drake/plans.html#static-files

I understand this limitation of knitr_in is described in the manual, so alternatively I used the dynamic files (where fill_placeholders returns a path list to the Rmarkdown files):

# updated part of R/plan.R from above
fills = target(fill_placeholders(file_in('_template.Rmd'), data = means),
                 dynamic = map(means), format = 'file'),
  render = render_with_deps(knitr_in('index.Rmd'),
                            file_in('_bookdown.yml'),
                            fills)

render_with_deps <- function(index, config, deps) {
  bookdown::render_book(
    index,
    config_file = config
  )
}

r_make()

Stack trace:

 Process 1793:
 1. (function (r_args = list())  ...
 2. drake:::r_make(r_args = r_args)
 3. drake:::r_drake(source, drake::make_impl, list(), r_fn, r_args)
 4. base:::do.call(r_fn, r_args)
 5. (function (func, args = list(), libpath = .libPaths(), repos = default_rep ...
 6. callr:::get_result(output = out, options)
 7. throw(newerr, parent = remerr[[2]])

 x callr subprocess failed: target render failed.
diagnose(render)$error$message:
  comparison (2) is possible only for atomic and list types
diagnose(render)$error$calls:
  global::render_with_deps(knitr_in("index.Rmd"), file_in("_bookdown.yml"), 
    fills)
  bookdown::render_book(index, config_file = config) 

 Process 9574:
 19. (function (source, d_fn, d_args)  ...
 20. base:::do.call(d_fn, d_args)
 21. (function (config)  ...
 22. drake:::process_targets(config)
 23. drake:::run_backend(config)
 24. drake:::drake_backend(config)
 25. drake:::drake_backend_loop(config)
 26. drake:::loop_check(config)
 27. drake:::local_build(target = targets[1], config = config, downstream = tar ...
 28. drake:::conclude_build(build, config)
 29. drake:::conclude_build_impl(value, target, meta, config)
 30. drake:::conclude_build_impl.default(value, target, meta, config)
 31. drake:::handle_build_exceptions(target = target, meta = meta,  ...
 32. drake:::handle_build_error(target, meta, config)
 33. drake:::log_failure(target, meta, config)
 34. drake:::stop0(msg)
 35. base:::stop(..., call. = FALSE)
 36. base:::.handleSimpleError(function (e)  ...
 37. h(simpleError(msg, call))

 x target render failed.
diagnose(render)$error$message:
  comparison (2) is possible only for atomic and list types
diagnose(render)$error$calls:
  global::render_with_deps(knitr_in("index.Rmd"), file_in("_bookdown.yml"), 
    fills)
  bookdown::render_book(index, config_file = config)

wlandau commented 3 years ago

In your chunk you list "index.Rmd", "chapter1.Rmd", "chapter2.Rmd" in knitr_in. If I am using a dynamic target for a list of input files, and let's say either I have thousands of input files or some are added/deleted... How can I use knitr_in?

knitr_in() is incompatible with dynamic branching, and in use cases like this, I do not think it is necessary. To insert a bunch of R Markdown source files programmatically, you can use tidy evaluation with !!.

source_files <- dir("your_directory")
plan <- drake_plan(reports = process_reports(knitr_in(!!source_files)))

And like I said before, R Markdown should render quickly, so you can already do a lot within a single target. You can avoid dynamic branching by condensing more work into a smaller number of targets. That's usually a better way to go anyway because each target takes a bigger bite out of runtime and overhead gets lower. (But if a target is too big, then it becomes an all-or-nothing situation where invalidating one target invalidates most of the pipeline.) For optimal performance, I would usually aim for 10-100 targets and try to distribute the runtime about evenly.

robitalec commented 3 years ago

Right but this dir("your_directory") can't be run outside of the plan, since those intermediate Rmd files are generated within the plan by the preceding target. Hmm.

I'm currently doing some pretty not satisfying, hacky thing where I pass drake parameters to setup up the dependencies, but then don't even use them in the render_with_deps function. Whenever I do pass the upstream targets properly to the render_book command (eg. bookdown::render_book(index, config)), I get an object not found "config" error.

render_with_deps <- function(index, config, deps) {
  bookdown::render_book('index.Rmd', config_file = '_bookdown.yml')
}

# ...

# (plan)
  fills = target(fill_placeholders(file_in('_template.Rmd'), data = means),
                 dynamic = map(means), format = 'file'),
  render = render_with_deps(knitr_in('index.Rmd'), '_bookdown.yml', fills)
}

I set this example up with targets, and interestingly I get the same object not found error once I get to tar_target(render, render_with_deps(index, config, fills)). Thanks for the help, think I'll sit on this for a bit and see if I can figure out another work around.

wlandau commented 3 years ago

Right but this dir("your_directory") can't be run outside of the plan, since those intermediate Rmd files are generated within the plan by the preceding target.

Yeah, in that case, knitr_in() is not usable here.

I'm currently doing some pretty not satisfying, hacky thing where I pass drake parameters to setup up the dependencies, but then don't even use them in the render_with_deps function. Whenever I do pass the upstream targets properly to the render_book command (eg. bookdown::render_book(index, config)), I get an object not found "config" error.

An unused deps argument or even ... does allow you declare deps just for the sake of forcing dependency relationships. That seems fine. The "object config not found" error seems solvable. Maybe set debug(render_with_deps) and then run regular make()?

robitalec commented 3 years ago

It seems to be different depending on if I use make or r_make:

In a fresh session

library(drake)

clean()

source('R/functions.R')

source('R/plan.R')

make(plan)

▶ dynamic reads
> subtarget reads_61b17d5f
> subtarget reads_f931b100
■ finalize reads
▶ dynamic means
> subtarget means_781a152c
> subtarget means_236020fc
■ finalize means
▶ dynamic fills
> subtarget fills_8ea5afed
> subtarget fills_848c3d6b
■ finalize fills
▶ target render

processing file: _main.Rmd
(... bookdown = success)

clean()

r_make()

▶ dynamic reads
> subtarget reads_61b17d5f
> subtarget reads_f931b100
■ finalize reads
▶ dynamic means
> subtarget means_781a152c
> subtarget means_236020fc
■ finalize means
▶ dynamic fills
> subtarget fills_8ea5afed
> subtarget fills_848c3d6b
■ finalize fills
▶ target render
x fail render
Error : target render failed.
diagnose(render)$error$message:
  comparison (2) is possible only for atomic and list types
diagnose(render)$error$calls:
  global::render_with_deps(knitr_in("index.Rmd"), file_in("_bookdown.yml"), 
    fills)
  bookdown::render_book(input = input, config_file = config)

This isn't the object not found error, it's one I mentioned above. Having a bit of trouble reproducing the object not found at the moment because I have been fiddling with these and didnt take a commit when that error was occuring. But here is the error trace from the targets version if it's of any interest/use

Stack trace:

 Process 11101:
 1. targets::tar_make()
 2. targets:::callr_outer(targets_function = tar_make_inner, target ...
 3. targets:::trn(is.null(callr_function), callr_inner(target_scrip ...
 4. base:::do.call(callr_function, prepare_callr_arguments(callr_fu ...
 5. (function (func, args = list(), libpath = .libPaths(), repos =  ...
 6. callr:::get_result(output = out, options)
 7. throw(newerr, parent = remerr[[2]])

 x callr subprocess failed: object 'config_file' not found . 

 Process 11342:
 19. (function (targets_script, targets_function, targets_arguments) ...
 20. base:::do.call(targets_function, targets_arguments)
 21. (function (pipeline, names_quosure, reporter)  ...
 22. local_init(pipeline = pipeline, names = names, queue = "sequent ...
 23. self$process_next()
 24. self$process_target(self$scheduler$queue$dequeue())
 25. targets:::trn(target_should_run(target, self$meta), self$run_ta ...
 26. self$run_target(name)
 27. targets:::target_conclude(target, self$pipeline, self$scheduler ...
 28. targets:::target_conclude.tar_builder(target, self$pipeline,  ...
 29. targets:::builder_error(target, pipeline, scheduler, meta)
 30. targets:::builder_handle_error(target, pipeline, scheduler, meta)
 31. targets:::throw_run(target$metrics$error)
 32. base:::stop(condition_run(...))
 33. (function (e)  ...

 x object 'config_file' not found .

wlandau commented 3 years ago

It's hard to know without a full end-to-end reprex. Not sure where exactly config gets defined. What if you restart your R session to empty out the global environment? That might cause make() error out.

robitalec commented 3 years ago

make() was run in a fresh, empty global environment when it wasn't giving the same error.

Here's the full toy project as a zip: drake-parameterized-reports.zip

wlandau commented 3 years ago

Thanks for the reprex. When I downloaded the zip and ran drake::r_make() inside, it actually ran without error. Could your main project be tangled up with old versions of functions you thought you changed?

robitalec commented 3 years ago

My apologies, I really should have checked this before.. This has been a bit of an iterative process, hitting errors/figuring out how to do this.. I updated my packages and the comparison (2) error disappeared.

For your interest (maybe), these are the first level dependencies (imports+suggests) that I updated:

R: updated[updated %in% unlist(pacman::p_depends(drake))]
[1] "future"    "rmarkdown" "storr"    
[4] "usethis"  
R: updated[updated %in% unlist(pacman::p_depends(bookdown))]
[1] "rmarkdown" "tinytex"   "xfun"

I checked the targets version of this example, and it seems to still be returning the object not found error. I'm going to check again to see if I can sort it out first and open an issue there if necessary. Thanks again.

ropensci / drake