maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Extract Images and Text. #55

Closed SuperPauly closed 3 years ago

SuperPauly commented 3 years ago

So far I can do both using PDFDocument and SimplePDFViewer. The problem I have is finding out where I should load a PDF into a PDFDocument to get the images or a SimplePDFViewer to get the text.

What I need to do is load it once, to see if I need to use an image extraction or text extraction tool. And then I can load the appropriate Object but by doing it seems more work to laod it once only to load it again. If the page in viewing has both text and images it means im doing to have to load the document 3 times.

Is there a way to get the Images and text from a document while also keeping the 'Special PDF parameters' out of the text? Im aware of the to_Pillow() method but again I can only use that with PDFDocument and not SimplePDFViewer which I extract the text with.

maxpmaxp commented 3 years ago

@Red-Fibre-Phoenix First of all you don't need to load the doc 2 or 3 times. One load is enough, then use either objects navigation for PDFDocument or pages navigation with SimplePDFViewer.

Then, you can use SimplePDFViewer to extract images and texts. The code below extracts all texts and images from PDF file:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open("somecoolfile.pdf", "rb")
viewer = SimplePDFViewer(fd)

images = []
strings = []
try:
    while True:
        # render current page
        viewer.render()
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        strings.extend(viewer.canvas.strings)
        # go to the next page
        viewer.next()
except PageDoesNotExist:
    pass

text = "".join(strings)

So what is the difference between PDFDocument and SimplePDFViewer?

  1. PDFDocument allows you to access raw document: file structure, document structure, objects, content-streams. It knows nothing about how to "render" content on a page. It's a kind of low-level tool.

  2. SimplePDFViewer - is a high level tool, which takes care of rendering texts and images for a page (properly decoding them and doing other stuff like that).