pandoc / lua-filters

A collection of lua filters for pandoc
MIT License
600 stars 165 forks source link

pagebreak filter doesn't work with Commonmark #255

Closed dmurdoch closed 1 year ago

dmurdoch commented 1 year ago

The pagebreak.lua filter depends on the raw_tex extension on the markdown reader, but that extension is not supported by commonmark or commonmark_x. This results in \pagebreak or \newpage being written to the output file with the backslash escaped, so the macro is visible instead of being translated into a page break.

Example: working in the lua-filters/pagebreak directory, this command

  pandoc --from commonmark  --to pdf sample.md -o sample.pdf --lua-filter pagebreak.lua

produces this output:

Screen Shot 2022-11-29 at 12 26 52 PM

The solution is to look for the macros in the Para() function of the filter. A complication is that commonmark+sourcepos splits the macros into two parts and wraps them in Span, the Para() function needs to handle that case too.

tarleb commented 1 year ago

You can make this work in CommonMark with

```{=latex}
\pagebreak


Requires the `raw_attribute` extension which is enabled by default in `commonmark_x`.
dmurdoch commented 1 year ago

Sure, but my thinking went as follows:

In favour of the change:

Against the change:

But CommonMark doesn't provide a way to enter a page break, so it needs to be some kind of extension, and this seems like a fairly harmless one. People who really want paragraphs containing nothing but \newpage or \pagebreak should just avoid using the filter.

tarleb commented 1 year ago

I think my preferred solution here would be to create a new filter that converts the special paragraphs into LaTeX, e.g.,

function Para (p)
  if is_pagebreak(p) then
    return pandoc.RawBlock('latex', pandoc.utils.stringify(p))
  end
end

Users would run the filter before pagebreak.lua.

There are two reasons for that:

  1. It's cleaner.
  2. Making the filter act on Para elements has a significant performance impact; most users should not have to pay that.

I'd be more open to adding support for special div's, so commonmark_x users could write

::: pagebreak
:::

or

{.pagebreak}
---

For plain CommonMark, an HTML-based syntax could be acceptable:

<hr class="pagebreak"/>
dmurdoch commented 1 year ago

The existing filter already works on Para elements, it looks for a single FF character there. The proposed test makes the test more complicated and so it will be slower, but is it really enough of a difference to be noticeable? (In the context where I'm using it I think the answer is almost certainly no: I run knitr, then Pandoc, then pdflatex. The Pandoc step is almost always very quick compared to the others.)

tarleb commented 1 year ago

You're right. I forgot about that. I'm still hesitant to add this kind of special case here.

dmurdoch commented 1 year ago

Regarding your proposed syntax choices: I think the one using ::: is the most readable, so it's the one I'd choose if new syntax is needed. But the back-compatibiity of \pagebreak (and its familiarity to people who know LaTeX) are still positives for it.

tarleb commented 1 year ago

I've moved the code for the pagebreak filter to pandoc-ext/pagebreak. The code has been updated to be more configurable; it would now be easier to implement the suggested changes without the mentioned drawbacks. PRs welcome.

Closing this here.