Closed FedericoNembrini closed 4 years ago
There is no such things as header and footer in PDF specification. You can read more on text objects in sec. 9.4
Nevertheless with pdfreader you can access text "markdown" which contains strings, positioning, device and other commands. Then you can cut off header and footer and extract strings with regular expressions for example (strings come in brackets).
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open("my_document.pdf", "rb")
viewer = SimplePDFViewer(fd)
try:
while True:
viewer.render()
content=viewer.canvas.text_content
# Process page content here
# - cut off header/footer
# - extract strings
viewer.next()
except PageDoesNotExist:
pass
It is possible to extract the content of a page not taking the document header and footer text?
In my case, i want to extract all the document text, but when i'm adding to the first page text the second page text it is interrupted by the header text. As the header can change over different files, i can't add a rule to the code to remove this text.
As far as you know, is this possible?