Page-break in other output formats than LaTeX

todd-a-jacobs commented 9 years ago

Pagebreaks Don't Work for Most Output Formats

I have a Markdown file that is supposed to have pagebreaks between certain sections. However, Pandoc 1.10.1 isn't honoring the \newpage or \pagebreak commands when rendering RTF, DOCX, or ODT formatted files. The commands I'm using to invoke pandoc are:

for format in rtf docx odt; do
    pandoc \
        --smart \
        --normalize \
        --standalone \
        --self-contained \
        -f markdown \
        -t $format \
        --output="${FILE/markdown/$format}" \
        "$FILE"
    echo "Created ${FILE/markdown/$format}"
done

PDF Seems to Work

However, the PDF format (which requires a slightly different invocation because it doesn't respect the -t flag) seems to respect the pagebreak requests. For example:

pandoc \
    --standalone \
    --normalize \
    --smart \
    --self-contained \
    --from=markdown \
    --output="${FILE/markdown/pdf}" \
    "$FILE"
echo "Created ${FILE/markdown/pdf}"

jgm commented 9 years ago

Correct, pandoc's internal document model does not currently contain anything corresponding to a page break, so there is no way to convert these. In principle a PageBreak element could be added. It's also possible to work around this deficiency using pandoc filters.

todd-a-jacobs commented 9 years ago

A PageBreak element would be great, but I'd be happy to use a filter in the meantime. However, I'm not sure what's entailed in doing so. How would I generate a DOCX with forced page breaks using a filtering mechanism?

jkr commented 9 years ago

@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

s7726 commented 9 years ago

@CodeGnome If your page breaks happen to be prior to a given heading level, you can just set the page break before property for that heading style.

Hi-Angel commented 8 years ago

I am also voting for the feature to be added — many formats have something according to a page break (even in CSS are things like page-break-\)*.

Hi-Angel commented 8 years ago

Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does @jgm notice the two years old pull request?

jgm commented 8 years ago

+++ Hi-Angel [Aug 01 15 08:28 ]:

Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does [1]@jgm notice [2]the two years old pull request?

Adding a NewPage element to the definition and builder is trivial. But then you need to support it in every reader and writer; that's a lot more work.

oadam commented 7 years ago

If a pull request adding support for NewPage was submitted (including support in every reader and writer), would it be accepted ? I really need this feature and I'm ready to spend time on this.

jgm commented 7 years ago

Yes, I'd accept it if it's of good quality.

Note, it requires a breaking change in pandoc-types. I'd like to make a new release soon of pandoc-types (which already has breaking changes) and pandoc. If you plan to do this soon I could wait a bit.

How do you propose to treat output formats with nothing corresponding to a page break?

Would it make sense, perhaps, to render it as a

Div ("",["pagebreak"],[]) []

which could at least be intercepted in filters? This could even be a native pandoc way of creating it.

oadam commented 7 years ago

I'll follow whatever recommendation you give :-)

If your code snippet means empty div with a pagebreak css class then yes that might be a good idea (it could be parsed as well by the html reader).

Maybe the writer could even add a inline style attribute with page-break-after: always ?

No need to wait for this before pushing your breaking change. To be honest, I won't look into it before at least a few weeks but it's definitely something that is on my business' road-map.

s7726 commented 7 years ago

Putting a class on an empty div won't work (or at least be portable).

http://www.w3schools.com/cssref/pr_print_pageba.asp

Note: You cannot use this property on an empty
or on absolutely positioned elements.

I recently found the page-break-avoid property. I applied it to

's that contained figures that needed to stay with that particular step in a procedure.

tarleb commented 7 years ago

MDN states on page-break-before (emphasis mine):

It won't apply on an empty <div> that won't generate a box.

I guess with a little bit of CSS hackery, the div could still be made to generate a box.

jgm commented 7 years ago

OK, that's good to know. So implementing a page break in the HTML writer might be nontrivial...but it's also not really essential -- I think it would be okay if we just supported formats that typically produce paginated output (latex, docx, etc.).

+++ Gavin S [Oct 14 16 11:44 ]:

Putting a class on an empty div won't work (or at least be portable).

[1]http://www.w3schools.com/cssref/pr_print_pageba.asp
Note: You cannot use this property on an empty <div> or on
absolutely positioned elements.
I recently found the page-break-avoid property. I applied it to
's that contained figures that needed to stay with that particular step in a procedure.

— You are receiving this because you were mentioned. Reply to this email directly, [2]view it on GitHub, or [3]mute the thread.

References

http://www.w3schools.com/cssref/pr_print_pageba.asp

https://github.com/jgm/pandoc/issues/1934#issuecomment-253887566

https://github.com/notifications/unsubscribe-auth/AAAL5HMatca2im4qobxWGKAd7nIHl7rZks5qz81_gaJpZM4Ded9Q

Jmuccigr commented 7 years ago

Would definitely like to see this.

And really would like to see printed html handle this too, but that's probably out of scope for pandoc.

mb21 commented 7 years ago

Some observations on how different formats handle page breaks:

From the perspective of HTML/CSS, page breaking is about layout, not structure, and is thus implemented in CSS (with the page-break-before and page-break-after properties, as supported by wkhtmltopdf – note that they might be superseded by break-before and break-after but browser support is not forthcoming). As has been noted, these can only be applied to block level elements and the intended usage is to apply them to headers or section divs.

In some restructured-text processors, a pagebreak can apparently also be achieved by a block level directive.

On the other hand, in more imperative document models (ODT, docx, etc), pagebreak usually seems to be an inline element. The pandoc AST already has inline LineBreak and SoftBreak elements and one possible implementation would be to replace them with an inline Break element that has an attribute type=line, type=soft, type=page,type=column etc. Note that implementing a native pandoc pagebreak element as inline is more general than a block element, since the block element can always be simulated by wrapping an inline in an otherwise empty paragraph.

Finally, from the perspective of markdown, I would probably use something like this:

------- {.pagebreak}

fabtho commented 7 years ago

I would like to see this to implemented. I just tried to write some filter for pandoc, to use pagebreack for md to ODT, but no success. (I used the source on Google Groups, as mentioned above)

link2xt commented 7 years ago

Muse format also has pagebreaks: http://amusewiki.org/library/manual#toc7

mb21 commented 6 years ago

btw, iA Writer pagebreak syntax is:

+++

which produces:

<div style="page-break-before: always;"></div>

which webkit-based browsers seem to understand.

autotel commented 5 years ago

another nice workaround:

insert a horizontal line -----------------
format the "horizontal line" style to break a page and be invisible, using the text editor (libre office in my case)

grenade commented 5 years ago

@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

thanks for this! i went down this rabbit hole today. it was my first foray into haskell and i'm pleased to say that i am now standing next to a completely bald yak¹. here's what happened:

the problem:

i have a github gist containing markdown files. i have a react app that transforms these markdown files into an html web page. i wanted a way to transform the same markdown files into a hosted google doc that has built in docx and pdf output formats.

the solution:

write some bash that combines all of the gist's markdown files into a single markdown file and use pandoc to transform the markdown into docx format that can be uploaded as a google doc.

the implementation:

use jq and the github gist api to produce a file containing the combined markdown
- the trick here is to insert a separator (\n\n\\newpage\n\n) between the individual markdown files that pandoc can interpret as a block paragraph containing only a page-break.
run pandoc against the combined markdown file to convert it into docx format
- here the trick is to correctly interpret the page-break separator tokens and use a filter to replace them with the correct docx xml separator syntax (<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>).
- create a haskell code file (docx-page-filter.hs) containing the filter (thank you Joel Allen and John MacFarlane):

import Text.Pandoc.JSON

pagebreakXml :: String
pagebreakXml = "<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>"

pagebreakBlock :: Block
pagebreakBlock = RawBlock (Format "openxml") pagebreakXml

blockSwapper :: Block -> Block
blockSwapper (Para [Str "\\newpage"])  = pagebreakBlock
blockSwapper blk = blk

main = toJSONFilter blockSwapper

the code above requires compiling but ghc --make -v docx-page-filter.hs throws an error about not being able to import Text.Pandoc.JSON. i don't know what version of ghc was already installed on my fedora-30 system or where it came from.
- download and install the distro build tools, the package manager and the pandoc dependencies:
  
  sudo dnf install ghc sudo dnf install cabal-install cabal update cabal install pandoc
- go have a coffee now. maybe even go for a run or mow the lawn. you have some time...

if everything compiles, you can run a command like this to perform the conversion:

pandoc combined.md --from gfm --filter docx-page-filter --to docx --output converted.docx

tarleb commented 4 years ago

The Lua filters repository has a pagebreak filter which converts raw \newpage commands into page breaks for most formats.

ghost commented 4 years ago

I wanted to note that Epub3 supports page breaks as well, although for possibly different use cases.

A page list and page break indicators allow users in mixed print-digital environments to coordinate their positions.

This is nice for preserving information about page numbers (e.g. for citations, printing, or accessibility such as audio queues) without interfering with the document layout.

It supports both in-line and block page breaks.

An empty span element identifies a page break inside a block element. It is identified as a page break using the role attribute with the value doc-pagebreak. The aria-label attribute provides an announceable value.
<p>
…
<span role="doc-pagebreak" id="pg24" aria-label="24"/>
…
</p>
A div element identifies a page break where inline elements are not allowed. This example shows an example of a page number that is intended to be visible in the content.
<div role="doc-pagebreak" id="pg24">24</div>

Some notes:

would need to keep a counter to mark the page numbers
intended to be placed at page beginnings, rather than endings
cannot be placed inside lists

My personal preference is for formfeed chars to be interpreted as page breaks, at least in markdown. I use the pdftotext CLI to produce formfeed-delimited text files that can be turned into markdown for pandoc, and it would be great if those could be preserved.

jeffmcneill commented 4 years ago

This might be somewhat related. Pagebreaks seem to be automatically supported in markdown->pdf in terms of H1s being recognized as new section headers, using:

  \usepackage{titlesec} 
  \newcommand{\sectionbreak}{\clearpage}

Also, when markdown->epub the same section headers H1 are recognized and page breaks are implemented. All fine and dandy.

I'm wondering if it is possible somehow to have H2s recognized as section breaks as well. The main reason is because I need to have both H1 and H2 act as section breaks (page breaks).

Ok, I've worked through these issues, and here is how I've dealt with them, so far: I've added \pagebreak before each new H2, that takes care of the latex/pdf side. For epub, I added the style:

h2 {display: block;
    page-break-before: always; /* CSS 2 */
    break-before: page;   /* CSS 3+ */ }

That seems to take care of the epub side.

If anyone has additional suggestions/options especially for the latex/pdf side, that would be great, but otherwise I've got it working.

jgm commented 4 years ago

Try the same thing with \subsectionbreak?

jeffmcneill commented 4 years ago

@jgm Excellent! It also supresses a page break if an H2 follows directly an H1, which is what I want. I can't seem to do that with Epub/CSS but that is less of an issue to have an extra page in an ebook, whereas one has to pay for each page in print.

  \usepackage{titlesec} 
  \newcommand{\sectionbreak}{\clearpage} 
  \newcommand{\subsectionbreak}{\clearpage}

Here is documentation of the various section commands that can be used with package titlesec. http://tug.ctan.org/tex-archive/macros/latex/contrib/titlesec/titlesec.pdf

SandeepNaidu commented 4 years ago

This still does not work for pandoc export to docx!

gmile commented 4 years ago

Had to introduce page breaks to html files that are being converted to .docx, ended up with this script in Lua:

function Para (el)
  if #el.content == 1 and el.content[1].text == "Pagebreak" then
    return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
  end
end

return {
  {Para = Para}
}

Given the following input:

<html>
  <body>
    <p>Page 1</p>
    <p>Pagebreak</p>
    <p>Page 2</p>
    <p>Pagebreak</p>
    <p>Page 3</p>
  </body>
</html>

It can be used like this:

pandoc input.html \
  --standalone \
  --lua-filter pagebreak.lua \
  --reference-doc my_styles.docx \
  --output output.docx

tarikgraba commented 3 years ago

Hi there,

Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader? This tag is generated by asciidoctor/asciidoc when inerting a page break.

It would be great to be able to convert DocBook to Latex without loosing this info.

dwojtas commented 2 years ago

Hi, I see no response to the <?asciidoc-pagebreak?> support request for the docbook reader, I would also benefit from this. I am processing documents

from asciidoc to docbook using asciidoctor
from docbook to docx using pandoc with custom docx template.

The effects are beautifull, but I must always post-process it by hand with Ctrl+Return to page-break on new chapters.

jgm commented 2 years ago

Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader?

There's no native AST element corresponding to a page break.

leogama commented 2 years ago

The R package rmarkdown has a good page break filter: https://github.com/rstudio/rmarkdown/blob/main/inst/rmarkdown/lua/pagebreak.lua

jgm / pandoc

Page-break in other output formats than LaTeX #1934

Pagebreaks Don't Work for Most Output Formats

PDF Seems to Work