[Question] Extract pages content excluding header and footer

maxpmaxp / pdfreader

Python API for PDF documents

MIT License

113 stars 26 forks source link

There is no such things as header and footer in PDF specification. You can read more on text objects in sec. 9.4

Nevertheless with pdfreader you can access text "markdown" which contains strings, positioning, device and other commands. Then you can cut off header and footer and extract strings with regular expressions for example (strings come in brackets).

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open("my_document.pdf", "rb")
viewer = SimplePDFViewer(fd)

try:
    while True:
        viewer.render()
        content=viewer.canvas.text_content
        # Process page content here
        # - cut off header/footer
        # - extract strings 
        viewer.next()
except PageDoesNotExist:
    pass

maxpmaxp / pdfreader

[Question] Extract pages content excluding header and footer #38