mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
5k stars 545 forks source link

Finding page breaks in generated html #256

Closed cksachdev closed 3 years ago

cksachdev commented 4 years ago

I need to extract the document content, page by page. In the generated HTML I am not able to find any tag which can help me distinguish that the content of page N is over. To demonstrate, please refer to the attached document. Structure of the document:

Generated HTML:

<h1>Heading on Page 1</h1>
<p>Sample content on Page 1</p>
<h1>Heading on Page 2</h1>
<p>Sample content on page 2</p>
<h1>Heading on Page 3</h1>

With the above-generated HTML, it's difficult to find when page 1 content is over. Is there any configuration setting to have a separator or custom tag which can be added to differentiate?

Update Debugging mammoth, found this function in document-to-html.js

  function htmlPathForBreak(element) {
    var style = findStyle(element);
    if (style) {
      return style.to;
    } else if (element.breakType === "line") {
      return htmlPaths.topLevelElement("br");
    } else {
      return htmlPaths.empty;
    }
  }

Looks like, I need to put a style map when running mammoth, but what should I put isn't clear.

Update 2 Found that I can update the generated HTML with a style map, but this may not work if I have multiple headings in the same page.

custom-style-map

p[style-name^='Heading'] => h1.fresh

Running with custom-style-map

--> mammoth worddoc2.docx --style-map=custom-style-map
<h1 class="fresh">Heading on Page 1</h1><p>Sample content on Page 1</p><h1 class="fresh">Heading on Page 2</h1><p>Sample content on page 2</p><h1 class="fresh">Heading on Page 3</h1>

OS: macOS Catalina 10.15.6 Node: v12.18.4 worddoc2.docx

mwilliamson commented 3 years ago

Page breaks are not currently supported, I'm afraid. Closing as a duplicate of #7.