rstudio / rmarkdown

Dynamic Documents for R
https://rmarkdown.rstudio.com
GNU General Public License v3.0
2.86k stars 969 forks source link

[FR] `*.knit.md` should have a unique name to allow parallelization #2454

Open salim-b opened 1 year ago

salim-b commented 1 year ago

rmarkdown / knitr / Pandoc can be slow when generating 1000s of PDFs, especially when using xelatex instead of the default pdflatex renderer. An easy and straightforward way to speed things up on multicore systems is to rely on the means provided by the awesome future framework, especially the furrr package, a drop-in replacement for purrr.

Now if one tries to combine rmarkdown::render() with say future_pwalk() and the multisession backend, the rendering will most likely fail due to the individual workers interfering each other by trying to read and write from/to the same intermediary *.knit.md file.

This can be worked around by giving the input filename a unique name in the map/walk function before feeding it to rmarkdown::render() (or setting a different intermediates_dir; but varying the intermediates dir is discouraged as far as I understand and introduced additional bugs in my tests).

A solution at the core of rmarkdown / knitr for this issue would be to give the intermediate *.knit.md a unique filename from the beginning. Ideally one that is based on its actual content (a fast hash function like the ones from the xxHash family would be suitable I guess; rlang::hash_file() provides XXH128). add a sufficiently large random string to all intermediate output filenames (per rmarkdown::render() invocation).

What do you think?

As I realized after submission, this issue could be considered a duplicate of https://github.com/rstudio/rmarkdown/issues/499.

cderv commented 1 year ago

This seems like a good idea. I wonder what it could break in other tools pipeline doing this 🤔 (also having Quarto in mind)

But is renaming the *.knit.md ? There will still be the problem of shared intermediate_dir, isn't it ? It is build upon the original input file name.

In your testing does changing the knit output name was only thing to do ?

Current possible small hack if you: the knit output extension can be changed for now using rmarkdown.knit.ext option. but not the meta part.

if hash is enough, we could do something like rmarkdown.knit.meta_ext options, where as a user (or external tool) you could provide the function you want ``

diff --git a/R/render.R b/R/render.R
index 9d73610d..60045dee 100644
--- a/R/render.R
+++ b/R/render.R
@@ -395,8 +395,12 @@ render <- function(input,
   # as `.md~` will be ignored.
   input <- basename(input)
   knit_input <- input
+  # input_based meta
+  meta_ext_fun <- getOption("rmarkdown.knit.meta_ext_fun", function(input) "knit")
+  if (!is.function(meta_ext_fun))
+    stop2("'rmarkdown.knit.meta_ext_fun' has been provided but it should be a function with 'input' as argument")
   knit_output <- intermediates_loc(
-    file_with_meta_ext(input, "knit", getOption("rmarkdown.knit.ext", "md"))
+    file_with_meta_ext(input, meta_ext_fun(input), getOption("rmarkdown.knit.ext", "md"))
   )
   intermediates <- c(intermediates, knit_output)

This way you could do options(rmarkdown.knit.meta_ext_fun = rlang::hash_file) to get an knit output file with a hash in the name (e.g mydoc.Rmd -> myDoc.61675d71e5e752429bd53f0824874f3d.md)

Just an idea if really changing the knit output filename helps.

salim-b commented 1 year ago

But is renaming the *.knit.md ? There will still be the problem of shared intermediate_dir, isn't it ? It is build upon the original input file name.

I actually didn't test only renaming the *.knit.md. Instead I gave the input file a unique name which worked. You're probably right about the rest of the intermediate files.

In your testing does changing the knit output name was only thing to do ?

No (didn't test that as stated above).

if hash is enough (...)

Actually, I don't think using a hash that is based solely on the input file content is a good idea (sorry for my initial suggestion). If we choose to render the exact same input file multiple times in parallel (e.g. with different params), we would run into the same interference again. So instead I'd recommend to use a sufficiently large random string derived from R's PRNG state, so there won't be any clashes in the first place.


Thanks about the code suggestion with a new rmarkdown.knit.meta_ext_fun option. But as written above, changing only the knit output filename will most likely not be enough. Instead all intermediate output files should contain the random part. I've update my initial post accordingly.

cderv commented 1 year ago

If we choose to render the exact same input file multiple times in parallel (e.g. with different params), we would run into the same interference again.

Ok - I agree with that and was suprised with the suggestion. Thanks for confirming it is not enough.

So instead I'd recommend to use a sufficiently large random string derived from R's PRNG state, so there won't be any clashes in the first place.

We could try that, but I believe all the intermediate directories generation should be changed. This is what #499 is about and not easy to do - this part is really sensible to breaking change. And also there is the specific case of LaTeX which expect a lot of its intermediates to be local to the input file.

Though I agree this would be very interesting to solve issues with parallel rendering. Currently, rmarkdown way of working implies that the parallelization usage requires to create wrapper around render that copy assets, creates some directories etc...

Quite a challenge that requires some dedicated time on this specifically, and probably lots of testings.

salim-b commented 1 year ago

We could try that, but I believe all the intermediate directories generation should be changed. This is what https://github.com/rstudio/rmarkdown/issues/499 is about and not easy to do - this part is really sensible to breaking change. And also there is the specific case of LaTeX which expect a lot of its intermediates to be local to the input file.

Well, instead of using a unique/isolated intermediate_dir per invocation of rmarkdown::render(), we could stick to the default one (the parent directory of the input file) but still append a random part to all intermediate files.

I.e. rendering my_file.Rmd would produce the intermediate files my_file-XXXXXXXX.md, my_file-XXXXXXXX.tex,my_file-XXXXXXXX.log, my_file-XXXXXXXX.aux and whatnot, where XXXXXXXX would correspond to the random string generated for that particular invocation of rmarkdown::render().

No?

cderv commented 1 year ago

Probably. I wonder how much we rely on the stem for the input file. Like myFile.Rmd -> using myFile for a lot of intermediate resource.

but we could probably track every occurence and change this. I just don't know how LaTeX would behave with this. Probably ok as the .tex file would be with the id in the name...

this worth trying for sure. We just think this is no an easy change - @yihui has more experiece about path handling over the years in R Markdown