Open todd-a-jacobs opened 9 years ago
Correct, pandoc's internal document model does not currently contain anything corresponding to a page break, so there is no way to convert these. In principle a PageBreak element could be added. It's also possible to work around this deficiency using pandoc filters.
A PageBreak element would be great, but I'd be happy to use a filter in the meantime. However, I'm not sure what's entailed in doing so. How would I generate a DOCX with forced page breaks using a filtering mechanism?
@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:
@CodeGnome If your page breaks happen to be prior to a given heading level, you can just set the page break before property for that heading style.
I am also voting for the feature to be added — many formats have something according to a page break (even in CSS are things like page-break-\)*.
Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does @jgm notice the two years old pull request?
+++ Hi-Angel [Aug 01 15 08:28 ]:
Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does [1]@jgm notice [2]the two years old pull request?
Adding a NewPage element to the definition and builder is trivial. But then you need to support it in every reader and writer; that's a lot more work.
If a pull request adding support for NewPage was submitted (including support in every reader and writer), would it be accepted ? I really need this feature and I'm ready to spend time on this.
Yes, I'd accept it if it's of good quality.
Note, it requires a breaking change in pandoc-types. I'd like to make a new release soon of pandoc-types (which already has breaking changes) and pandoc. If you plan to do this soon I could wait a bit.
How do you propose to treat output formats with nothing corresponding to a page break?
Would it make sense, perhaps, to render it as a
Div ("",["pagebreak"],[]) []
which could at least be intercepted in filters? This could even be a native pandoc way of creating it.
I'll follow whatever recommendation you give :-)
If your code snippet means empty div with a pagebreak
css class then yes that might be a good idea (it could be parsed as well by the html reader).
Maybe the writer could even add a inline style attribute with page-break-after: always
?
No need to wait for this before pushing your breaking change. To be honest, I won't look into it before at least a few weeks but it's definitely something that is on my business' road-map.
Putting a class on an empty div won't work (or at least be portable).
http://www.w3schools.com/cssref/pr_print_pageba.asp
Note: You cannot use this property on an empty
or on absolutely positioned elements.I recently found the page-break-avoid property. I applied it to
's that contained figures that needed to stay with that particular step in a procedure.
MDN states on page-break-before
(emphasis mine):
It won't apply on an empty
<div>
that won't generate a box.
I guess with a little bit of CSS hackery, the div
could still be made to generate a box.
OK, that's good to know. So implementing a page break in the HTML writer might be nontrivial...but it's also not really essential -- I think it would be okay if we just supported formats that typically produce paginated output (latex, docx, etc.).
+++ Gavin S [Oct 14 16 11:44 ]:
Putting a class on an empty div won't work (or at least be portable).
[1]http://www.w3schools.com/cssref/pr_print_pageba.asp
Note: You cannot use this property on an empty <div> or on absolutely positioned elements.
I recently found the page-break-avoid property. I applied it to
's that contained figures that needed to stay with that particular step in a procedure. — You are receiving this because you were mentioned. Reply to this email directly, [2]view it on GitHub, or [3]mute the thread.
References
Would definitely like to see this.
And really would like to see printed html handle this too, but that's probably out of scope for pandoc.
Some observations on how different formats handle page breaks:
From the perspective of HTML/CSS, page breaking is about layout, not structure, and is thus implemented in CSS (with the page-break-before
and page-break-after
properties, as supported by wkhtmltopdf – note that they might be superseded by break-before
and break-after
but browser support is not forthcoming). As has been noted, these can only be applied to block level elements and the intended usage is to apply them to headers or section divs.
In some restructured-text processors, a pagebreak can apparently also be achieved by a block level directive.
On the other hand, in more imperative document models (ODT, docx, etc), pagebreak usually seems to be an inline element. The pandoc AST already has inline LineBreak
and SoftBreak
elements and one possible implementation would be to replace them with an inline Break
element that has an attribute type=line
, type=soft
, type=page
,type=column
etc. Note that implementing a native pandoc pagebreak element as inline is more general than a block element, since the block element can always be simulated by wrapping an inline in an otherwise empty paragraph.
Finally, from the perspective of markdown, I would probably use something like this:
------- {.pagebreak}
I would like to see this to implemented. I just tried to write some filter for pandoc, to use pagebreack for md to ODT, but no success. (I used the source on Google Groups, as mentioned above)
Muse format also has pagebreaks: http://amusewiki.org/library/manual#toc7
btw, iA Writer pagebreak syntax is:
+++
which produces:
<div style="page-break-before: always;"></div>
which webkit-based browsers seem to understand.
another nice workaround:
-----------------
@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:
thanks for this! i went down this rabbit hole today. it was my first foray into haskell and i'm pleased to say that i am now standing next to a completely bald yak¹. here's what happened:
the problem:
i have a github gist containing markdown files. i have a react app that transforms these markdown files into an html web page. i wanted a way to transform the same markdown files into a hosted google doc that has built in docx and pdf output formats.
the solution:
write some bash that combines all of the gist's markdown files into a single markdown file and use pandoc to transform the markdown into docx format that can be uploaded as a google doc.
the implementation:
\n\n\\newpage\n\n
) between the individual markdown files that pandoc can interpret as a block paragraph containing only a page-break.<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>
).docx-page-filter.hs
) containing the filter (thank you Joel Allen and John MacFarlane):import Text.Pandoc.JSON
pagebreakXml :: String
pagebreakXml = "<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>"
pagebreakBlock :: Block
pagebreakBlock = RawBlock (Format "openxml") pagebreakXml
blockSwapper :: Block -> Block
blockSwapper (Para [Str "\\newpage"]) = pagebreakBlock
blockSwapper blk = blk
main = toJSONFilter blockSwapper
the code above requires compiling but ghc --make -v docx-page-filter.hs
throws an error about not being able to import Text.Pandoc.JSON
. i don't know what version of ghc was already installed on my fedora-30 system or where it came from.
download and install the distro build tools, the package manager and the pandoc dependencies:
sudo dnf install ghc sudo dnf install cabal-install cabal update cabal install pandoc
go have a coffee now. maybe even go for a run or mow the lawn. you have some time...
if everything compiles, you can run a command like this to perform the conversion:
pandoc combined.md --from gfm --filter docx-page-filter --to docx --output converted.docx
The Lua filters repository has a pagebreak filter which converts raw \newpage
commands into page breaks for most formats.
I wanted to note that Epub3 supports page breaks as well, although for possibly different use cases.
A page list and page break indicators allow users in mixed print-digital environments to coordinate their positions.
This is nice for preserving information about page numbers (e.g. for citations, printing, or accessibility such as audio queues) without interfering with the document layout.
It supports both in-line and block page breaks.
An empty span element identifies a page break inside a block element. It is identified as a page break using the role attribute with the value doc-pagebreak. The aria-label attribute provides an announceable value.
<p> … <span role="doc-pagebreak" id="pg24" aria-label="24"/> … </p>
A div element identifies a page break where inline elements are not allowed. This example shows an example of a page number that is intended to be visible in the content.
<div role="doc-pagebreak" id="pg24">24</div>
Some notes:
My personal preference is for formfeed chars to be interpreted as page breaks, at least in markdown. I use the pdftotext
CLI to produce formfeed-delimited text files that can be turned into markdown for pandoc, and it would be great if those could be preserved.
This might be somewhat related. Pagebreaks seem to be automatically supported in markdown->pdf in terms of H1s being recognized as new section headers, using:
\usepackage{titlesec}
\newcommand{\sectionbreak}{\clearpage}
Also, when markdown->epub the same section headers H1 are recognized and page breaks are implemented. All fine and dandy.
I'm wondering if it is possible somehow to have H2s recognized as section breaks as well. The main reason is because I need to have both H1 and H2 act as section breaks (page breaks).
Ok, I've worked through these issues, and here is how I've dealt with them, so far: I've added \pagebreak
before each new H2, that takes care of the latex/pdf side. For epub, I added the style:
h2 {display: block;
page-break-before: always; /* CSS 2 */
break-before: page; /* CSS 3+ */ }
That seems to take care of the epub side.
If anyone has additional suggestions/options especially for the latex/pdf side, that would be great, but otherwise I've got it working.
Try the same thing with \subsectionbreak
?
@jgm Excellent! It also supresses a page break if an H2 follows directly an H1, which is what I want. I can't seem to do that with Epub/CSS but that is less of an issue to have an extra page in an ebook, whereas one has to pay for each page in print.
\usepackage{titlesec}
\newcommand{\sectionbreak}{\clearpage}
\newcommand{\subsectionbreak}{\clearpage}
Here is documentation of the various section commands that can be used with package titlesec. http://tug.ctan.org/tex-archive/macros/latex/contrib/titlesec/titlesec.pdf
This still does not work for pandoc export to docx!
Had to introduce page breaks to html files that are being converted to .docx, ended up with this script in Lua:
function Para (el)
if #el.content == 1 and el.content[1].text == "Pagebreak" then
return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
end
end
return {
{Para = Para}
}
Given the following input:
<html>
<body>
<p>Page 1</p>
<p>Pagebreak</p>
<p>Page 2</p>
<p>Pagebreak</p>
<p>Page 3</p>
</body>
</html>
It can be used like this:
pandoc input.html \
--standalone \
--lua-filter pagebreak.lua \
--reference-doc my_styles.docx \
--output output.docx
Hi there,
Can the support of <?asciidoc-pagebreak?>
added to the XML DocBook reader?
This tag is generated by asciidoctor/asciidoc when inerting a page break.
It would be great to be able to convert DocBook to Latex without loosing this info.
Hi,
I see no response to the <?asciidoc-pagebreak?>
support request for the docbook reader, I would also benefit from this.
I am processing documents
The effects are beautifull, but I must always post-process it by hand with Ctrl+Return
to page-break on new chapters.
Can the support of
<?asciidoc-pagebreak?>
added to the XML DocBook reader?
There's no native AST element corresponding to a page break.
The R package rmarkdown
has a good page break filter: https://github.com/rstudio/rmarkdown/blob/main/inst/rmarkdown/lua/pagebreak.lua
Pagebreaks Don't Work for Most Output Formats
I have a Markdown file that is supposed to have pagebreaks between certain sections. However, Pandoc 1.10.1 isn't honoring the
\newpage
or\pagebreak
commands when rendering RTF, DOCX, or ODT formatted files. The commands I'm using to invoke pandoc are:PDF Seems to Work
However, the PDF format (which requires a slightly different invocation because it doesn't respect the
-t
flag) seems to respect the pagebreak requests. For example: