Open grimbough opened 2 years ago
@grimbough thanks for the report.
Can you explain the hack done for the soul package ?
having backslash seems the way Pandoc transforms it
❯ pandoc -t latex
Using the `images = "lowres"` is fine.
^Z
Using the \texttt{images\ =\ "lowres"} is fine.
Also using rmarkdown::pdf_document
will have the escaped spaces in the intermediates .tex
output.
I would like to understand what exactly is needed for this soul package.
That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file
(https://github.com/Bioconductor/BiocStyle/issues/93#issuecomment-998838743)
The addition of the soul package to the BiocStyle .sty
predates my time contributing to the package, so I can't comment on why we choose to use it, but the initial reports of problems with it can be found in https://github.com/Bioconductor/BiocStyle/issues/14 That seems to acknowledge that the pandoc output is incompatible with the soul package, hence the need for some sort of patch.
The comment on the soul patch reads "# substitute all control spaces "\ " in \texttt by regular spaces" and the actual patch code can be found in the post_processor
function at:
https://github.com/Bioconductor/BiocStyle/blob/96ad6f7cc4d87f9705f51bacb5c62fc4c7e4f6de/R/pdf_document.R#L146-L163
Also using rmarkdown::pdf_document will have the escaped spaces in the intermediates .tex output.
I think this isn't a problem because rmarkdown::pdf_document
doesn't use the soul package in it's style file.
That being said, I don't yet understand why a post processor would not apply if an extension is set to output_file (Bioconductor/BiocStyle#93 (comment))
The problem in this case is not that the post processor doesn't get applied when we set a file extension, it's more that unless the file extension is .tex
the function latexmk()
gets called before the post processor is applied. For BiocStyle::pdf_document
this results in an error because that post processor patch is necessary for the latex file to be compiled, and for something like bookdown::pdf_document2
it means the output PDF file is generated twice (although you'll only notice that if stepping through the code).
I the case where we don't set a file extension to the output_file
argument the code will use whatever is set in output_format$pandoc$ext
which for BiocStyle::pdf_document
or bookdown::pdf_document2
is .tex
rather than .pdf
. This is behaviour is performed by: https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L487-L491
Thanks for your feedback. I got some hint yesterday in my bug hunt but did not share because it was not ended. I think you helped us unravel a odd behavior and potentially important hidden bug.
I found the part of the code that make the post processing not run correctly https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L981
The code you shared above is trying to determine what should be the output file extension when none is asked (most of the time output_file
is auto determined), and you're right with the difference between bookdown::pdf_document2()
and pdf_document()
.
bookdown::pdf_document2()
seems to enforce .tex
extension because it will handle the PDF conversion in its post processor after tweaking the tex
file. When extension is not .tex
, output PDF is assumed and the conversion to PDF happens directly inside rmarkdown::render()
before any post processing of the .tex
file.
The post processor will be run after on the output file being a pdf. Until recently, pdf_document()
had no post processor and it seemed to work. But there is one today and I don't think it is working correctly. Oddly, it throws a warning only in debug mode.
bookdpwn::pdf_document2()
doesn't have an issue because
.tex
output: https://github.com/rstudio/bookdown/blob/4e859c0edede084643d27e13060dc5f1275d2433/R/latex.R#L47tinytex::latexmk()
for PDF conversion: https://github.com/rstudio/bookdown/blob/4e859c0edede084643d27e13060dc5f1275d2433/R/latex.R#L65-L68.tex
to .pdf
conversion is run twice !
First in
https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L981
and then in
https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L1001-L1006
From what I understand, BiocStyle::pdf_document()
requires to have a processing of the .tex
file so that the PDF can be built. With current behavior, the file extension needs to be .tex
, otherwise a first conversion will happen before any post processing. When output_file
is set with a .tex
extension, rmarkdown::render()
will only convert .md
to .tex
, and the post processor bookdown::pdf_document2()
will run. It will run your custom post processor, then its post processor, including a .tex
to .pdf
conversion.
With the current way rmarkdown is working, BiocStyle::pdf_document()
won't work when output_file
is set using an extension different than .tex
.
In a way, output_format$post processor
is supposed to run on the output file from the conversion. So when output is supposed to be PDF, it runs after pdf conversion, but I don't think it makes sense - usually you want to post process the .tex
file from after pandoc conversion.
I don't think all of this is expected and we will look to improve that. I need to discuss with @yihui to understand the historical behavior and see how we can change that.
Personally, I did not have any idea of this issue - thanks for reporting ! That is an odd bug - it took me time to understand.
I think this isn't a problem because rmarkdown::pdf_document doesn't use the soul package in it's style file.
About that, I tried to reproduce the initial BiocStyle is trying to solve with it patch by using soul
CTAN package in a regular PDF document. At first, I did not manage. By looking at the .sty
file, I understand that the issue comes from
% local redefinitions in \texttt to allow use of \`{} and \^{} in \hl
\renewcommand{\texttt}[2][codecolor]{{%
\def\`{\textasciigrave}%
\def\^{\textasciicircum}%
\textcolor{#1}{\hl{\ttfamily#2}}%
}}
This will redefine the the \texttt
command, and pass its content to \hl
which is a soul
command it seems. That is where control space \
is not supported and you need the tweak.
I was just curious to understand. I don't know if there is anything else that can be done to be compatible with control spaces.
Thanks for the feedback. I agree it's an odd bug, and also took me quite some time to pin down the behaviour. I was reluctant to report anything until I was sure it wasn't BiocStyle doing something unsupported. Reading through I think we've come to exactly the same conclusions about what's happening, so that's encouraging!
From my point of view there no problem if it takes a while to fix - that soul patch has been in BiocStyle for 5 years and it's only been reported now, so I guess providing a non-default output file name doesn't occur very often for our users, and we have the "no extension" work around anyway.
I guess there's something philosophical about the choice of output file if someone does something weird like output_format = "BiocStyle::pdf_document()"
and output_file = "test.html"
- should they get a PDF, an HTML, or a PDF called "test.html"? At the moment the render()
code takes slightly more cues from the output file name than the output format options. I'm not sure what's "right" but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex"
generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf"
throws an error!
but it definitely seems weird that output_format = BiocStyle::pdf_document(), output_file = "test.tex" generates a PDF and output_format = BiocStyle::pdf_document(), output_file = "test.pdf" throws an error!
I can understand - it is not like that with all formats. Somehow in all relies on what BiocStyle::pdf_document()
is using internally.
output_format = BiocStyle::pdf_document(), output_file = "test.pdf"
should work but BiocStyle requires a tex post processing before rendering the PDF, with does not happen unless render()
is tricked to produce a tex
file only and PDF conversion happens in the post processor. bookdown takes that into account but the double latexmk()
rendering does not seem right. I'll look into that.
This output_format()
mechanism is not really documented and a lot of different things happens in render()
depending on arguments. One day we'll document and it may unravel more than this odd behavior 😅
Thanks a lot for this discussion ! We'll work on rmarkdown early next year so we'll probably modify this.
Checklist
When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:
[x] formatted your issue so it is easier for us to read?
[x] included a minimal, self-contained, and reproducible example?
[x] pasted the output from
xfun::session_info('rmarkdown')
in your issue?[x] upgraded all your packages to their latest versions (including your versions of R, the RStudio IDE, and relevant R packages)?
[x] installed and tested your bug with the development version of the rmarkdown package using
remotes::install_github("rstudio/rmarkdown")
?This issue concerns the BiocStyle package and producing a PDF vignette from R Markdown. This issue has initially been tracked in https://github.com/Bioconductor/BiocStyle/issues/93 I'm happy to incorporate necessary changes into BiocStyle, but wanted to report the unexpected behaviour here.
Consider the following minimal Rmd file (referred to as
test.Rmd
below) using theBiocStyle::pdf_document
output format:However it fails when supply an
output_file
argument. (In the example below I've usedtest1.pdf
to distinguish between files, but it fails with the same error even if you use is the same as the default i.e.test.pdf
).Examining the two TeX files produced, the line containing the inline code differs, and seems to have some escaped spaces in the version that fails to compile. The two lines are show below for comparison.
Normally BiocStyle is aware of this issue and attempts to remove the escaped spaces before Latex is called. However this patch is not applied when we specify the name of an output file with ".pdf" extension.
I believe the difference in output starts with the following code block:
https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L487-L491
In our example where
output_file
is unspecified then it will be assignedtest.tex
(the value ofoutput_auto
) with the file extension reflecting the definition inoutput_format$pandoc$ext
. When we provideoutput_file = "test1.pdf"
as an argument it obviously has ".pdf" as the extension.This then affects the following code block, which is executed only in the case where we've specified
output_file = "test1.pdf"
:https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L982
In the case of
output_file = "test1.pdf"
the latex compilation is carried out immediately at this step vialatexmk()
and fails due to something in the BiocStyle template. However whenoutput_file
has a ".tex" extension compilation is deferred and carried out as part ofoutput_format$post_processor
, which also patches the template issues. This is based on the technique used in bookdown.Hopefully the above makes sense! I'm not sure if we're doing something unsupported in BiocStyle but it seems the crux of this issue is that specifying the output file name will override the setting defined by our
output_format$pandoc
, and cause latex compilation to happen earlier than it otherwise would. I wanted to check whether this is the expected behaviour. I believe the issue is also present (but silently works) if you useoutput_format = bookdown::pdf_document2
. Therelatexmk()
will be called twice - once inrender()
and once in theoutput_format$post_processor.
with the second overwriting the first.