mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
4.97k stars 543 forks source link

Support page breaks #7

Open JohnMcLear opened 10 years ago

JohnMcLear commented 10 years ago

Any docs for how to support page breaks?

mwilliamson commented 10 years ago

What sort of behaviour would you expect? Page breaks strike me as being an artefact of printing of paper, which doesn't really apply when translating to HTML. Open to suggestions though.

JohnMcLear commented 10 years ago

To be honest just whacking in a <span class='pageBreak'></span> would be fine for me.

I'd expect you would want me to use a custom style rule for this, if that's teh case that's fine just lemme know which stylemap key to use :)

I use page-break-after:always;page-break-inside:avoid;-webkit-region-break-inside: avoid; to generate the actual page breaks in Etherpad.

MCTaylor17 commented 8 years ago

I don't use it, but I happen to know that Dreamweaver would wrap content in <div> tags to mark section-breaks (Page Layout > Page Setup > Breaks). I wouldn't be a fan of this approach as I sometimes use parent-child selectors in my CSS.

I'm not sure that I like the idea of adding classes to the output, @JohnMcLear.

How about adding a simple <hr/> tag?

mwilliamson commented 8 years ago

The suggestion of using a custom style mapping is the approach that seems best to me. That way, by default we do nothing, but the user can customise the behaviour to whatever HTML they want.

swapnil-bawkar commented 7 years ago

Can you give me example of how to write style map for page breaks to hr tag?

mwilliamson commented 7 years ago

Page breaks aren't supported at the moment. There's some code to handle them, but that likely requires some more work.

For the technical detail: one way that Word encodes page breaks is as an element within a paragraph. As it works right now, that would result in hr tags with p elements, which likely isn't the desired behaviour. Lifting the breaks up to the top level is likely to give better results.

jkorff commented 5 years ago

I have a use case where customers – wrongly – insert page breaks at the end of pages, and I need to replace them with a space. For that reason it would be good to have a style mapping available that captures (manual) page breaks.

pirtlj commented 1 year ago

Having the page breaks would be nice for translating into other formats or processing the output html

jerefrer commented 8 months ago

Hi there, and thank you for this awesome lib :)

I'm using mammoth to turn a structured (with specific styles) .docx file into HTML, do some tweaks on it and then use PagedJS to turn it into a PDF to be printed.

In this case the output is in fact paper again, so page breaks do matter.

Could you please consider supporting page breaks ?

If you have never stumbled upon this, there is a whole open-source movement (the Coko Foundation) advocating for using HTML as the Single Source for publishing books and journal papers using the CSS PagedMedia standard to define the layout of the PDF output. This standard hasn't been implemented yet by any of the major browsers so they built PagedJS that is in essence a glorified polyfill for this standard that is already used in production for many publishing houses, and recently used to produce both a book and a webapp for the Louvres in Paris from the same HTML source.

mwilliamson commented 8 months ago

As above, the problem is that it's not obvious (to me, at least!) what the expected behaviour would be, given a page break can occur in the middle of a paragraph.

If you can provide a minimal example document and the expected HTML (especially with mid-paragraph page breaks), then that would help.

jerefrer commented 8 months ago

Here I meant only manual page breaks, it didn't even occur to me that one would want to know about automatic page breaks when text naturally overflows a page and continues on the next one :)

In the case of manual page breaks is that already possible ? For me it could be either a separate tag or a way to apply a specific CSS class to the first element after the page break. If there is already a way to do this maybe adding it to the doc wouldn't hurt :)

mwilliamson commented 8 months ago

There's some support for breaks, but it is intentionally undocumented since it's still subject to change.

Could you provide a minimal example document and the expected HTML?

jerefrer commented 8 months ago

Alright so here's a very simple example .docx file: example.docx

What I'd like to get back would be either this:

<p>This content is on page one.</p>
<hr>
<p>This one on page two.</p>
<p><em>And it has</em></p>
<h1>Some more content to it</h1>
<h2>With a few styles.</h2>
<hr>
<p>This is page three.</p>

or something like that:

<p>This content is on page one.</p>
<p class="break-before">This one on page two.</p>
<p><em>And it has</em></p>
<h1>Some more content to it</h1>
<h2>With a few styles.</h2>
<hr>
<p class="break-before">This is page three.</p>
mwilliamson commented 8 months ago

I think you can already use a style map along the lines of:

br[type='page'] => hr

to get what you want, but be warned that the exact syntax and behaviour might change in the future!

jerefrer commented 8 months ago

It's working 🎉 If it starts breaking one day I'll know where to look :) Thanks!