rstudio / rmarkdown

Dynamic Documents for R
https://rmarkdown.rstudio.com
GNU General Public License v3.0
2.84k stars 966 forks source link

Differing output when providing `output_file` argument to `render()` #2264

Open grimbough opened 2 years ago

grimbough commented 2 years ago

Checklist

When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:

This issue concerns the BiocStyle package and producing a PDF vignette from R Markdown. This issue has initially been tracked in https://github.com/Bioconductor/BiocStyle/issues/93 I'm happy to incorporate necessary changes into BiocStyle, but wanted to report the unexpected behaviour here.

Consider the following minimal Rmd file (referred to as test.Rmd below) using the BiocStyle::pdf_document output format:

---
title: "Vignette Title"
output:
  BiocStyle::pdf_document:
    keep_tex: true
---

Using the `images = "lowres"` is fine.
rmarkdown::render("test.Rmd")
#> ...
#> Output created: test.pdf

However it fails when supply an output_file argument. (In the example below I've used test1.pdf to distinguish between files, but it fails with the same error even if you use is the same as the default i.e. test.pdf).

rmarkdown::render("test.Rmd", output_file = "test1.pdf")
#> ....
#> ! Package soul Error: Reconstruction failed.

Examining the two TeX files produced, the line containing the inline code differs, and seems to have some escaped spaces in the version that fails to compile. The two lines are show below for comparison.

Using the \texttt{images = "lowres"} is fine.   %Compiles
Using the \texttt{images\ =\ "lowres"} is fine. %Does not compile

Normally BiocStyle is aware of this issue and attempts to remove the escaped spaces before Latex is called. However this patch is not applied when we specify the name of an output file with ".pdf" extension.

I believe the difference in output starts with the following code block:

https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L487-L491

In our example where output_file is unspecified then it will be assigned test.tex (the value of output_auto) with the file extension reflecting the definition in output_format$pandoc$ext. When we provide output_file = "test1.pdf" as an argument it obviously has ".pdf" as the extension.

This then affects the following code block, which is executed only in the case where we've specified output_file = "test1.pdf":

https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L982

In the case of output_file = "test1.pdf" the latex compilation is carried out immediately at this step via latexmk() and fails due to something in the BiocStyle template. However when output_file has a ".tex" extension compilation is deferred and carried out as part of output_format$post_processor, which also patches the template issues. This is based on the technique used in bookdown.


Hopefully the above makes sense! I'm not sure if we're doing something unsupported in BiocStyle but it seems the crux of this issue is that specifying the output file name will override the setting defined by our output_format$pandoc, and cause latex compilation to happen earlier than it otherwise would. I wanted to check whether this is the expected behaviour. I believe the issue is also present (but silently works) if you use output_format = bookdown::pdf_document2. There latexmk() will be called twice - once in render() and once in the output_format$post_processor. with the second overwriting the first.


> xfun::session_info(c('rmarkdown', 'BiocStyle'))
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.2, RStudio 2021.9.0.351

Locale:
  LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
  LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
  LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
  LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

Package version:
  base64enc_0.1.3     BiocManager_1.30.16 BiocStyle_2.23.2    bookdown_0.24      
  bslib_0.3.1         digest_0.6.29       evaluate_0.14       fastmap_1.1.0      
  fs_1.5.2            glue_1.6.0          graphics_4.1.1      grDevices_4.1.1    
  highr_0.9           htmltools_0.5.2     jquerylib_0.1.4     jsonlite_1.7.2     
  knitr_1.37          magrittr_2.0.1      methods_4.1.1       R6_2.5.1           
  rappdirs_0.3.3      rlang_0.4.12        rmarkdown_2.11.3    sass_0.4.0         
  stats_4.1.1         stringi_1.7.6       stringr_1.4.0       tinytex_0.36       
  tools_4.1.1         utils_4.1.1         xfun_0.29           yaml_2.2.1         

Pandoc version: 2.14.0.3
cderv commented 2 years ago

@grimbough thanks for the report.

Can you explain the hack done for the soul package ?

having backslash seems the way Pandoc transforms it

❯ pandoc -t latex
Using the `images = "lowres"` is fine.
^Z
Using the \texttt{images\ =\ "lowres"} is fine.

Also using rmarkdown::pdf_document will have the escaped spaces in the intermediates .tex output.

I would like to understand what exactly is needed for this soul package.

That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file (https://github.com/Bioconductor/BiocStyle/issues/93#issuecomment-998838743)

grimbough commented 2 years ago

The addition of the soul package to the BiocStyle .sty predates my time contributing to the package, so I can't comment on why we choose to use it, but the initial reports of problems with it can be found in https://github.com/Bioconductor/BiocStyle/issues/14 That seems to acknowledge that the pandoc output is incompatible with the soul package, hence the need for some sort of patch.

The comment on the soul patch reads "# substitute all control spaces "\ " in \texttt by regular spaces" and the actual patch code can be found in the post_processor function at: https://github.com/Bioconductor/BiocStyle/blob/96ad6f7cc4d87f9705f51bacb5c62fc4c7e4f6de/R/pdf_document.R#L146-L163

Also using rmarkdown::pdf_document will have the escaped spaces in the intermediates .tex output.

I think this isn't a problem because rmarkdown::pdf_document doesn't use the soul package in it's style file.

That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file (Bioconductor/BiocStyle#93 (comment))

The problem in this case is not that the post processor doesn't get applied when we set a file extension, it's more that unless the file extension is .tex the function latexmk() gets called before the post processor is applied. For BiocStyle::pdf_document this results in an error because that post processor patch is necessary for the latex file to be compiled, and for something like bookdown::pdf_document2 it means the output PDF file is generated twice (although you'll only notice that if stepping through the code).

I the case where we don't set a file extension to the output_file argument the code will use whatever is set in output_format$pandoc$ext which for BiocStyle::pdf_document or bookdown::pdf_document2 is .tex rather than .pdf. This is behaviour is performed by: https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L487-L491

cderv commented 2 years ago

Thanks for your feedback. I got some hint yesterday in my bug hunt but did not share because it was not ended. I think you helped us unravel a odd behavior and potentially important hidden bug.

I found the part of the code that make the post processing not run correctly https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L981

The code you shared above is trying to determine what should be the output file extension when none is asked (most of the time output_file is auto determined), and you're right with the difference between bookdown::pdf_document2() and pdf_document().

bookdown::pdf_document2() seems to enforce .tex extension because it will handle the PDF conversion in its post processor after tweaking the tex file. When extension is not .tex, output PDF is assumed and the conversion to PDF happens directly inside rmarkdown::render() before any post processing of the .tex file.

The post processor will be run after on the output file being a pdf. Until recently, pdf_document() had no post processor and it seemed to work. But there is one today and I don't think it is working correctly. Oddly, it throws a warning only in debug mode.

bookdpwn::pdf_document2() doesn't have an issue because

.tex to .pdf conversion is run twice ! First in https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L981 and then in https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L1001-L1006

From what I understand, BiocStyle::pdf_document() requires to have a processing of the .tex file so that the PDF can be built. With current behavior, the file extension needs to be .tex, otherwise a first conversion will happen before any post processing. When output_file is set with a .tex extension, rmarkdown::render() will only convert .md to .tex, and the post processor bookdown::pdf_document2() will run. It will run your custom post processor, then its post processor, including a .tex to .pdf conversion.

With the current way rmarkdown is working, BiocStyle::pdf_document() won't work when output_file is set using an extension different than .tex.

In a way, output_format$post processor is supposed to run on the output file from the conversion. So when output is supposed to be PDF, it runs after pdf conversion, but I don't think it makes sense - usually you want to post process the .tex file from after pandoc conversion.

I don't think all of this is expected and we will look to improve that. I need to discuss with @yihui to understand the historical behavior and see how we can change that.

Personally, I did not have any idea of this issue - thanks for reporting ! That is an odd bug - it took me time to understand.

cderv commented 2 years ago

I think this isn't a problem because rmarkdown::pdf_document doesn't use the soul package in it's style file.

About that, I tried to reproduce the initial BiocStyle is trying to solve with it patch by using soul CTAN package in a regular PDF document. At first, I did not manage. By looking at the .sty file, I understand that the issue comes from

https://github.com/Bioconductor/BiocStyle/blob/932ae2eacb8766dc1bb552be0ecfd1a71e6a2723/inst/resources/tex/Bioconductor.sty#L274-L279

% local redefinitions in \texttt to allow use of \`{}  and \^{} in \hl
\renewcommand{\texttt}[2][codecolor]{{%
 \def\`{\textasciigrave}%
 \def\^{\textasciicircum}%
 \textcolor{#1}{\hl{\ttfamily#2}}%
}}

This will redefine the the \texttt command, and pass its content to \hl which is a soul command it seems. That is where control space \ is not supported and you need the tweak.

I was just curious to understand. I don't know if there is anything else that can be done to be compatible with control spaces.

grimbough commented 2 years ago

Thanks for the feedback. I agree it's an odd bug, and also took me quite some time to pin down the behaviour. I was reluctant to report anything until I was sure it wasn't BiocStyle doing something unsupported. Reading through I think we've come to exactly the same conclusions about what's happening, so that's encouraging!

From my point of view there no problem if it takes a while to fix - that soul patch has been in BiocStyle for 5 years and it's only been reported now, so I guess providing a non-default output file name doesn't occur very often for our users, and we have the "no extension" work around anyway.


I guess there's something philosophical about the choice of output file if someone does something weird like output_format = "BiocStyle::pdf_document()" and output_file = "test.html" - should they get a PDF, an HTML, or a PDF called "test.html"? At the moment the render() code takes slightly more cues from the output file name than the output format options. I'm not sure what's "right" but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex" generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf" throws an error!

cderv commented 2 years ago

but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex" generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf" throws an error!

I can understand - it is not like that with all formats. Somehow in all relies on what BiocStyle::pdf_document() is using internally.

output_format = BiocStyle::pdf_document(), output_file = "test.pdf" should work but BiocStyle requires a tex post processing before rendering the PDF, with does not happen unless render() is tricked to produce a tex file only and PDF conversion happens in the post processor. bookdown takes that into account but the double latexmk() rendering does not seem right. I'll look into that.

This output_format() mechanism is not really documented and a lot of different things happens in render() depending on arguments. One day we'll document and it may unravel more than this odd behavior 😅

Thanks a lot for this discussion ! We'll work on rmarkdown early next year so we'll probably modify this.