maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

[Question] Extract pages content excluding header and footer #38

Closed FedericoNembrini closed 4 years ago

FedericoNembrini commented 4 years ago

It is possible to extract the content of a page not taking the document header and footer text?

In my case, i want to extract all the document text, but when i'm adding to the first page text the second page text it is interrupted by the header text. As the header can change over different files, i can't add a rule to the code to remove this text.

As far as you know, is this possible?

maxpmaxp commented 4 years ago

There is no such things as header and footer in PDF specification. You can read more on text objects in sec. 9.4

Nevertheless with pdfreader you can access text "markdown" which contains strings, positioning, device and other commands. Then you can cut off header and footer and extract strings with regular expressions for example (strings come in brackets).

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open("my_document.pdf", "rb")
viewer = SimplePDFViewer(fd)

try:
    while True:
        viewer.render()
        content=viewer.canvas.text_content
        # Process page content here
        # - cut off header/footer
        # - extract strings 
        viewer.next()
except PageDoesNotExist:
    pass