pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.99k stars 933 forks source link

For optimization of extracting text page by page #533

Closed playgithub closed 4 years ago

playgithub commented 4 years ago

Questions:

  1. StringIO.seek and StringIO.truncate seems not efficient enough, any better way?
  2. parser will not be close on exception, but with PDFParser(fp) as parser not work, any good way?
from io import StringIO
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter

path = 'sample.PDF'

with open(path, 'rb') as fp:
    parser = PDFParser(fp)
    doc = PDFDocument(parser)

    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed

    resource_manager = PDFResourceManager()

    with StringIO() as string_io:
        with TextConverter(resource_manager, string_io) as device:
            interpreter = PDFPageInterpreter(resource_manager, device)
            for page in PDFPage.create_pages(doc):
                interpreter.process_page(page)
                string_io.seek(0)
                t = string_io.read()
                string_io.seek(0)
                string_io.truncate()
                print(t)

    parser.close()
pietermarsman commented 4 years ago

Hi @playgithub,

Thanks for your question! Unfortunately I don't think I totally understand what your asking for. Could you be more specific on what you want, and why you need StringIO?

playgithub commented 4 years ago

camelot is slow when parsing a lot of the pages to find some table, so I'l like to use pdfminer to find pages as candidates to be parsed by camelot first, the finding based on text matching. When the text on a page matches the condidion as a condidate, the page index is saved to be used by camelot.

pietermarsman commented 4 years ago

So I assume you want to extract all the text from a page, inspect the result and save the page number if the result has some property.

If you don't want to use StringIO.seek and StringIO.truncate maybe reinitialize the StringIO for every page like this:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.utils import open_filename

laparams = LAParams()

with open_filename('your.pdf', "rb") as fp:
    rsrcmgr = PDFResourceManager(caching=True)

    for page in PDFPage.get_pages(fp, caching=True):
        with StringIO() as output_string:
            device = TextConverter(rsrcmgr, output_string, laparams=laparams)
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            interpreter.process_page(page)
            print(output_string.getvalue())

About the other question:

parser will not be close on exception

What do you mean by this? As far as I know the parser has no connections that it manages. It gets a file-like object, but that is managed outside the parser.

playgithub commented 4 years ago

So I assume you want to extract all the text from a page, inspect the result and save the page number if the result has some property.

If you don't want to use StringIO.seek and StringIO.truncate maybe reinitialize the StringIO for every page like this:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.utils import open_filename

laparams = LAParams()

with open_filename('your.pdf', "rb") as fp:
    rsrcmgr = PDFResourceManager(caching=True)

    for page in PDFPage.get_pages(fp, caching=True):
        with StringIO() as output_string:
            device = TextConverter(rsrcmgr, output_string, laparams=laparams)
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            interpreter.process_page(page)
            print(output_string.getvalue())

Less code, I've tested it, same performance.

About the other question:

parser will not be close on exception

What do you mean by this? As far as I know the parser has no connections that it manages. It gets a file-like object, but that is managed outside the parser.

From the view of api, parse.close() should be called when parser is not used anymore. Maybe I have to use try ... finally .. to make sure parse.close() will be called.

pietermarsman commented 4 years ago

Less code, I've tested it, same performance.

On second thought, I think the speed of StringIO is neglectable compared to the speed of pdfminer itself.

A try, except, finally indeed makes sense.

playgithub commented 4 years ago

On second thought, I think the speed of StringIO is neglectable compared to the speed of pdfminer itself.

But pdfminer has no api to extract text page by page, so StringIO is used, any better way is appreciated.

pietermarsman commented 4 years ago

Yes, I get that using StringIO is a bit cumbersome. But using input and output streams in the way it was setup and its difficult to change. Anyway, besides being cumbersome it doesn't give any performance penalty.