opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
210 stars 40 forks source link

Detect and ignore page numbers #10

Closed LoneRifle closed 5 years ago

LoneRifle commented 5 years ago

Self-explanatory - page numbers are not needed in Markdown since such files are meant to be read online, not as a formal document. Detect the presence of such numbers and eliminate them in the final output

jenlky commented 5 years ago

To be fixed by #37

jenlky commented 5 years ago

Let me get this right. The idea is to implement a filter on textContent.items and filter the page number based on a variety of conditions like page 2 of 5, - 5 -, 5? Would that suffice?

LoneRifle commented 5 years ago

it would be helpful too to consider not just the content of the item, but also its position

LoneRifle commented 5 years ago

38 resulted in a regression. If someone could rework the PR based on feedback on #39 , we could reinstate this feature

jenlky commented 5 years ago

I will do it. Let me take a look.

LoneRifle commented 5 years ago

Fixed by #41