Pagination of big docx files

mwilliamson / mammoth.js

Convert Word documents (.docx files) to HTML

BSD 2-Clause "Simplified" License

4.9k stars 532 forks source link

Open stalniy opened 5 years ago

stalniy commented 5 years ago

It would be good to have a possibility to convert big docx file by chunks (by few pages).

kennylbj commented 5 years ago

+1 For now, there is no page info return by convertToHtml func.

pboysen commented 4 years ago

I too have a need to treat each page of a Word document as an HTML page.

After reading your code, would this be solved by a style rule of

"br[type='page'] => div.page:fresh"

and then split the output with

<div class="page"></div>

or whatever element you choose.

It would need an option like ignorePageBreak to change the value in docx/body-reader.js/ignoreElements. Of course,, it may be more complicated that.

theZappr commented 1 year ago

Defo this is needed for my team :+1:

motivatedclay commented 1 year ago

This would be extremely useful for our team, where having the page number metadata will be very helpful for GPT to parse our documents properly