pagebreak filter doesn't work with Commonmark

dmurdoch commented 1 year ago

The pagebreak.lua filter depends on the raw_tex extension on the markdown reader, but that extension is not supported by commonmark or commonmark_x. This results in \pagebreak or \newpage being written to the output file with the backslash escaped, so the macro is visible instead of being translated into a page break.

Example: working in the lua-filters/pagebreak directory, this command

  pandoc --from commonmark  --to pdf sample.md -o sample.pdf --lua-filter pagebreak.lua

produces this output:

Screen Shot 2022-11-29 at 12 26 52 PM

The solution is to look for the macros in the Para() function of the filter. A complication is that commonmark+sourcepos splits the macros into two parts and wraps them in Span, the Para() function needs to handle that case too.

tarleb commented 1 year ago

You can make this work in CommonMark with

```{=latex}
\pagebreak



Requires the `raw_attribute` extension which is enabled by default in `commonmark_x`.

dmurdoch commented 1 year ago

Sure, but my thinking went as follows:

In favour of the change:

there are a lot of existing documents using the simpler syntax, and they'll all be broken if Pandoc transitions to CommonMark without this change. It was one of the first issues I saw when I tried to use the sourcepos extension in R Markdown documents.
Markdown is supposed to be readable, and it's more readable than the fenced solution.

Against the change:

It doesn't fit the CommonMark design very well, which is the reason the raw_tex extension is incompatible with the commonmark reader. The spec says "Backslashes before other characters are treated as literal backslashes".

But CommonMark doesn't provide a way to enter a page break, so it needs to be some kind of extension, and this seems like a fairly harmless one. People who really want paragraphs containing nothing but \newpage or \pagebreak should just avoid using the filter.

tarleb commented 1 year ago

I think my preferred solution here would be to create a new filter that converts the special paragraphs into LaTeX, e.g.,

function Para (p)
  if is_pagebreak(p) then
    return pandoc.RawBlock('latex', pandoc.utils.stringify(p))
  end
end

Users would run the filter before pagebreak.lua.

There are two reasons for that:

It's cleaner.
Making the filter act on Para elements has a significant performance impact; most users should not have to pay that.

I'd be more open to adding support for special div's, so commonmark_x users could write

::: pagebreak
:::

or

{.pagebreak}
---

For plain CommonMark, an HTML-based syntax could be acceptable:

<hr class="pagebreak"/>

dmurdoch commented 1 year ago

The existing filter already works on Para elements, it looks for a single FF character there. The proposed test makes the test more complicated and so it will be slower, but is it really enough of a difference to be noticeable? (In the context where I'm using it I think the answer is almost certainly no: I run knitr, then Pandoc, then pdflatex. The Pandoc step is almost always very quick compared to the others.)

tarleb commented 1 year ago

You're right. I forgot about that. I'm still hesitant to add this kind of special case here.

dmurdoch commented 1 year ago

Regarding your proposed syntax choices: I think the one using ::: is the most readable, so it's the one I'd choose if new syntax is needed. But the back-compatibiity of \pagebreak (and its familiarity to people who know LaTeX) are still positives for it.

tarleb commented 1 year ago

I've moved the code for the pagebreak filter to pandoc-ext/pagebreak. The code has been updated to be more configurable; it would now be easier to implement the suggested changes without the mentioned drawbacks. PRs welcome.

Closing this here.

pandoc / lua-filters

pagebreak filter doesn't work with Commonmark #255