jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.34k stars 3.31k forks source link

Page-break in other output formats than LaTeX #1934

Open todd-a-jacobs opened 9 years ago

todd-a-jacobs commented 9 years ago

Pagebreaks Don't Work for Most Output Formats

I have a Markdown file that is supposed to have pagebreaks between certain sections. However, Pandoc 1.10.1 isn't honoring the \newpage or \pagebreak commands when rendering RTF, DOCX, or ODT formatted files. The commands I'm using to invoke pandoc are:

for format in rtf docx odt; do
    pandoc \
        --smart \
        --normalize \
        --standalone \
        --self-contained \
        -f markdown \
        -t $format \
        --output="${FILE/markdown/$format}" \
        "$FILE"
    echo "Created ${FILE/markdown/$format}"
done

PDF Seems to Work

However, the PDF format (which requires a slightly different invocation because it doesn't respect the -t flag) seems to respect the pagebreak requests. For example:

pandoc \
    --standalone \
    --normalize \
    --smart \
    --self-contained \
    --from=markdown \
    --output="${FILE/markdown/pdf}" \
    "$FILE"
echo "Created ${FILE/markdown/pdf}"
jgm commented 9 years ago

Correct, pandoc's internal document model does not currently contain anything corresponding to a page break, so there is no way to convert these. In principle a PageBreak element could be added. It's also possible to work around this deficiency using pandoc filters.

todd-a-jacobs commented 9 years ago

A PageBreak element would be great, but I'd be happy to use a filter in the meantime. However, I'm not sure what's entailed in doing so. How would I generate a DOCX with forced page breaks using a filtering mechanism?

jkr commented 9 years ago

@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

s7726 commented 9 years ago

@CodeGnome If your page breaks happen to be prior to a given heading level, you can just set the page break before property for that heading style.

Hi-Angel commented 8 years ago

I am also voting for the feature to be added — many formats have something according to a page break (even in CSS are things like page-break-\)*.

Hi-Angel commented 8 years ago

Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does @jgm notice the two years old pull request?

jgm commented 8 years ago

+++ Hi-Angel [Aug 01 15 08:28 ]:

Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does [1]@jgm notice [2]the two years old pull request?

Adding a NewPage element to the definition and builder is trivial. But then you need to support it in every reader and writer; that's a lot more work.

oadam commented 7 years ago

If a pull request adding support for NewPage was submitted (including support in every reader and writer), would it be accepted ? I really need this feature and I'm ready to spend time on this.

jgm commented 7 years ago

Yes, I'd accept it if it's of good quality.

Note, it requires a breaking change in pandoc-types. I'd like to make a new release soon of pandoc-types (which already has breaking changes) and pandoc. If you plan to do this soon I could wait a bit.

How do you propose to treat output formats with nothing corresponding to a page break?

Would it make sense, perhaps, to render it as a

Div ("",["pagebreak"],[]) []

which could at least be intercepted in filters? This could even be a native pandoc way of creating it.

oadam commented 7 years ago

I'll follow whatever recommendation you give :-)

If your code snippet means empty div with a pagebreak css class then yes that might be a good idea (it could be parsed as well by the html reader).

Maybe the writer could even add a inline style attribute with page-break-after: always ?

No need to wait for this before pushing your breaking change. To be honest, I won't look into it before at least a few weeks but it's definitely something that is on my business' road-map.

s7726 commented 7 years ago

Putting a class on an empty div won't work (or at least be portable).

http://www.w3schools.com/cssref/pr_print_pageba.asp

Note: You cannot use this property on an empty

or on absolutely positioned elements.

I recently found the page-break-avoid property. I applied it to

  • 's that contained figures that needed to stay with that particular step in a procedure.

  • tarleb commented 7 years ago

    MDN states on page-break-before (emphasis mine):

    It won't apply on an empty <div> that won't generate a box.

    I guess with a little bit of CSS hackery, the div could still be made to generate a box.

    jgm commented 7 years ago

    OK, that's good to know. So implementing a page break in the HTML writer might be nontrivial...but it's also not really essential -- I think it would be okay if we just supported formats that typically produce paginated output (latex, docx, etc.).

    +++ Gavin S [Oct 14 16 11:44 ]:

    Putting a class on an empty div won't work (or at least be portable).

    [1]http://www.w3schools.com/cssref/pr_print_pageba.asp

    Note: You cannot use this property on an empty <div> or on
    absolutely positioned elements.

    I recently found the page-break-avoid property. I applied it to

  • 's that contained figures that needed to stay with that particular step in a procedure.

    — You are receiving this because you were mentioned. Reply to this email directly, [2]view it on GitHub, or [3]mute the thread.

    References

    1. http://www.w3schools.com/cssref/pr_print_pageba.asp
    2. https://github.com/jgm/pandoc/issues/1934#issuecomment-253887566
    3. https://github.com/notifications/unsubscribe-auth/AAAL5HMatca2im4qobxWGKAd7nIHl7rZks5qz81_gaJpZM4Ded9Q
  • Jmuccigr commented 7 years ago

    Would definitely like to see this.

    And really would like to see printed html handle this too, but that's probably out of scope for pandoc.

    mb21 commented 7 years ago

    Some observations on how different formats handle page breaks:

    From the perspective of HTML/CSS, page breaking is about layout, not structure, and is thus implemented in CSS (with the page-break-before and page-break-after properties, as supported by wkhtmltopdf – note that they might be superseded by break-before and break-after but browser support is not forthcoming). As has been noted, these can only be applied to block level elements and the intended usage is to apply them to headers or section divs.

    In some restructured-text processors, a pagebreak can apparently also be achieved by a block level directive.

    On the other hand, in more imperative document models (ODT, docx, etc), pagebreak usually seems to be an inline element. The pandoc AST already has inline LineBreak and SoftBreak elements and one possible implementation would be to replace them with an inline Break element that has an attribute type=line, type=soft, type=page,type=column etc. Note that implementing a native pandoc pagebreak element as inline is more general than a block element, since the block element can always be simulated by wrapping an inline in an otherwise empty paragraph.

    Finally, from the perspective of markdown, I would probably use something like this:

    ------- {.pagebreak}
    fabtho commented 7 years ago

    I would like to see this to implemented. I just tried to write some filter for pandoc, to use pagebreack for md to ODT, but no success. (I used the source on Google Groups, as mentioned above)

    link2xt commented 7 years ago

    Muse format also has pagebreaks: http://amusewiki.org/library/manual#toc7

    mb21 commented 6 years ago

    btw, iA Writer pagebreak syntax is:

    +++

    which produces:

    <div style="page-break-before: always;"></div>

    which webkit-based browsers seem to understand.

    autotel commented 5 years ago

    another nice workaround:

    grenade commented 5 years ago

    @CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

    https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

    thanks for this! i went down this rabbit hole today. it was my first foray into haskell and i'm pleased to say that i am now standing next to a completely bald yak¹. here's what happened:

    the problem:

    i have a github gist containing markdown files. i have a react app that transforms these markdown files into an html web page. i wanted a way to transform the same markdown files into a hosted google doc that has built in docx and pdf output formats.

    the solution:

    write some bash that combines all of the gist's markdown files into a single markdown file and use pandoc to transform the markdown into docx format that can be uploaded as a google doc.

    the implementation:

    import Text.Pandoc.JSON
    
    pagebreakXml :: String
    pagebreakXml = "<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>"
    
    pagebreakBlock :: Block
    pagebreakBlock = RawBlock (Format "openxml") pagebreakXml
    
    blockSwapper :: Block -> Block
    blockSwapper (Para [Str "\\newpage"])  = pagebreakBlock
    blockSwapper blk = blk
    
    main = toJSONFilter blockSwapper
    tarleb commented 4 years ago

    The Lua filters repository has a pagebreak filter which converts raw \newpage commands into page breaks for most formats.

    ghost commented 4 years ago

    I wanted to note that Epub3 supports page breaks as well, although for possibly different use cases.

    A page list and page break indicators allow users in mixed print-digital environments to coordinate their positions.

    This is nice for preserving information about page numbers (e.g. for citations, printing, or accessibility such as audio queues) without interfering with the document layout.

    It supports both in-line and block page breaks.

    An empty span element identifies a page break inside a block element. It is identified as a page break using the role attribute with the value doc-pagebreak. The aria-label attribute provides an announceable value.

    <p>
    …
    <span role="doc-pagebreak" id="pg24" aria-label="24"/>
    …
    </p>

    A div element identifies a page break where inline elements are not allowed. This example shows an example of a page number that is intended to be visible in the content.

    <div role="doc-pagebreak" id="pg24">24</div>

    Some notes:

    My personal preference is for formfeed chars to be interpreted as page breaks, at least in markdown. I use the pdftotext CLI to produce formfeed-delimited text files that can be turned into markdown for pandoc, and it would be great if those could be preserved.

    jeffmcneill commented 4 years ago

    This might be somewhat related. Pagebreaks seem to be automatically supported in markdown->pdf in terms of H1s being recognized as new section headers, using:

      \usepackage{titlesec} 
      \newcommand{\sectionbreak}{\clearpage} 

    Also, when markdown->epub the same section headers H1 are recognized and page breaks are implemented. All fine and dandy.

    I'm wondering if it is possible somehow to have H2s recognized as section breaks as well. The main reason is because I need to have both H1 and H2 act as section breaks (page breaks).


    Ok, I've worked through these issues, and here is how I've dealt with them, so far: I've added \pagebreak before each new H2, that takes care of the latex/pdf side. For epub, I added the style:

    h2 {display: block;
        page-break-before: always; /* CSS 2 */
        break-before: page;   /* CSS 3+ */ }

    That seems to take care of the epub side.

    If anyone has additional suggestions/options especially for the latex/pdf side, that would be great, but otherwise I've got it working.

    jgm commented 4 years ago

    Try the same thing with \subsectionbreak?

    jeffmcneill commented 4 years ago

    @jgm Excellent! It also supresses a page break if an H2 follows directly an H1, which is what I want. I can't seem to do that with Epub/CSS but that is less of an issue to have an extra page in an ebook, whereas one has to pay for each page in print.

      \usepackage{titlesec} 
      \newcommand{\sectionbreak}{\clearpage} 
      \newcommand{\subsectionbreak}{\clearpage} 

    Here is documentation of the various section commands that can be used with package titlesec. http://tug.ctan.org/tex-archive/macros/latex/contrib/titlesec/titlesec.pdf

    SandeepNaidu commented 4 years ago

    This still does not work for pandoc export to docx!

    gmile commented 4 years ago

    Had to introduce page breaks to html files that are being converted to .docx, ended up with this script in Lua:

    function Para (el)
      if #el.content == 1 and el.content[1].text == "Pagebreak" then
        return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
      end
    end
    
    return {
      {Para = Para}
    }

    Given the following input:

    <html>
      <body>
        <p>Page 1</p>
        <p>Pagebreak</p>
        <p>Page 2</p>
        <p>Pagebreak</p>
        <p>Page 3</p>
      </body>
    </html>

    It can be used like this:

    pandoc input.html \
      --standalone \
      --lua-filter pagebreak.lua \
      --reference-doc my_styles.docx \
      --output output.docx
    tarikgraba commented 3 years ago

    Hi there,

    Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader? This tag is generated by asciidoctor/asciidoc when inerting a page break.

    It would be great to be able to convert DocBook to Latex without loosing this info.

    dwojtas commented 2 years ago

    Hi, I see no response to the <?asciidoc-pagebreak?> support request for the docbook reader, I would also benefit from this. I am processing documents

    The effects are beautifull, but I must always post-process it by hand with Ctrl+Return to page-break on new chapters.

    jgm commented 2 years ago

    Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader?

    There's no native AST element corresponding to a page break.

    leogama commented 2 years ago

    The R package rmarkdown has a good page break filter: https://github.com/rstudio/rmarkdown/blob/main/inst/rmarkdown/lua/pagebreak.lua