I need to extract the document content, page by page. In the generated HTML I am not able to find any tag which can help me distinguish that the content of page N is over. To demonstrate, please refer to the attached document.
Structure of the document:
Heading with Heading 1 style
Text below it(same is followed in all the 3 pages, intentionally no content added in page 3)
Generated HTML:
<h1>Heading on Page 1</h1>
<p>Sample content on Page 1</p>
<h1>Heading on Page 2</h1>
<p>Sample content on page 2</p>
<h1>Heading on Page 3</h1>
With the above-generated HTML, it's difficult to find when page 1 content is over. Is there any configuration setting to have a separator or custom tag which can be added to differentiate?
Update
Debugging mammoth, found this function in document-to-html.js
function htmlPathForBreak(element) {
var style = findStyle(element);
if (style) {
return style.to;
} else if (element.breakType === "line") {
return htmlPaths.topLevelElement("br");
} else {
return htmlPaths.empty;
}
}
Looks like, I need to put a style map when running mammoth, but what should I put isn't clear.
Update 2
Found that I can update the generated HTML with a style map, but this may not work if I have multiple headings in the same page.
custom-style-map
p[style-name^='Heading'] => h1.fresh
Running with custom-style-map
--> mammoth worddoc2.docx --style-map=custom-style-map
<h1 class="fresh">Heading on Page 1</h1><p>Sample content on Page 1</p><h1 class="fresh">Heading on Page 2</h1><p>Sample content on page 2</p><h1 class="fresh">Heading on Page 3</h1>
I need to extract the document content, page by page. In the generated HTML I am not able to find any tag which can help me distinguish that the content of page N is over. To demonstrate, please refer to the attached document. Structure of the document:
Generated HTML:
With the above-generated HTML, it's difficult to find when page 1 content is over. Is there any configuration setting to have a separator or custom tag which can be added to differentiate?
Update Debugging mammoth, found this function in document-to-html.js
Looks like, I need to put a style map when running mammoth, but what should I put isn't clear.
Update 2 Found that I can update the generated HTML with a style map, but this may not work if I have multiple headings in the same page.
custom-style-map
Running with custom-style-map
OS: macOS Catalina 10.15.6 Node: v12.18.4 worddoc2.docx