rstudio / rmarkdown

Dynamic Documents for R
https://rmarkdown.rstudio.com
GNU General Public License v3.0
2.84k stars 970 forks source link

Wrong post processing pipeline for pdf_document() conversion #2282

Open cderv opened 2 years ago

cderv commented 2 years ago

Found while working #2264,

The post processor will be run after on the output file being a pdf. Until recently, pdf_document() had no post processor and it seemed to work. But there is one today and I don't think it is working correctly. Oddly, it throws a warning only in debug mode.

A recent fix introduced a post processor in the pdf_document() output. https://github.com/rstudio/rmarkdown/commit/5a3e941bbebf14c1a1a797125ffc414f61edc664

This creates some issues I believe but it does not show in CI oddly.

output_format$post_processor() is supposed to run on the output of the Pandoc conversion. While looking into the other issue https://github.com/rstudio/rmarkdown/issues/2264#issuecomment-999476710, it happens that the conversion step (before post processing) will generate a PDF file using latexmk(). when output file is pdf for example. https://github.com/rstudio/rmarkdown/blob/0af6b3556adf6e393b2da23c66c695724ea7bd2d/R/render.R#L974-L981

In that case the post_processor() will try to post process the PDF file https://github.com/rstudio/rmarkdown/blob/5a3e941bbebf14c1a1a797125ffc414f61edc664/R/pdf_document.R#L191-L200

This does not work obviously. Reprex:

dir.create(tmp_dir <- tempfile())
owd <- setwd(tmp_dir)
xfun::download_file("https://raw.githubusercontent.com/rstudio/rmarkdown/main/tests/testthat/test-formats.Rmd")
rmarkdown::render("test-formats.Rmd", rmarkdown::pdf_document(), quiet = TRUE)
#> Warning message:
#> In readLines(con, warn = FALSE) :  invalid input found on input connection 'test-formats.pdf'

The post processing step should happen before latexmk() after Pandoc conversion from .md to .tex. I don't think it is working right now.

Or maybe the latexmk() rendering should happen in the pdf_document() post processing 🤔 But that would mean for user creating custom output that they need to use pdf_document() as base format or do the conversion themself. Maybe room for a pdf_document_base() format that does essential processing ?

I get all those warning message when running tests locally, I am not sure why we don't see them in CI. 🤔

atusy commented 2 years ago

the latexmk() rendering should happen in the pdf_document() post processing

I think this makes the logic very clear. Also, this may fix #2267 as well by using output formats without ` the post processing.

I had an idea to apply latexmk() after post_processor() in render(), however, I threw it away because it breaks output formats that intentionally apply post_processor() to PDF files.

cderv commented 2 years ago

Thanks for the feedback !

however, I threw it away because it breaks output formats that intentionally apply post_processor() to PDF files.

Which format are you thinking off ? I am curious about post processing of PDF files (instead of tex file before PDF conversion)

atusy commented 2 years ago

Not on a public repository, but I used to apply post processing on PDF to outline fonts to ensure it prints safely by a command like

gs -o output.pdf -dNoOutputFonts -sDEVICE=pdfwrite input.pdf
cderv commented 2 years ago

Thanks for sharing. Good to know this could be useful.

Probably for PDF conversion we need to consider current pre_processor / post_processor as hooks for Pandoc conversion (.md to .tex) and we would need to add post processing hook for the PDF conversion by tinytex::latexmk() 🤔

Seems like a piece is missing here.

cderv commented 2 years ago

2320 fixed the issue with the warnings due to wrong post processing. I'll leave this issue open to deal in a later version to fix the way post processing is happening for PDF.