Closed playgithub closed 4 years ago
Hi @playgithub,
Thanks for your question! Unfortunately I don't think I totally understand what your asking for. Could you be more specific on what you want, and why you need StringIO
?
camelot is slow when parsing a lot of the pages to find some table, so I'l like to use pdfminer to find pages as candidates to be parsed by camelot first, the finding based on text matching. When the text on a page matches the condidion as a condidate, the page index is saved to be used by camelot.
So I assume you want to extract all the text from a page, inspect the result and save the page number if the result has some property.
If you don't want to use StringIO.seek
and StringIO.truncate
maybe reinitialize the StringIO
for every page like this:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.utils import open_filename
laparams = LAParams()
with open_filename('your.pdf', "rb") as fp:
rsrcmgr = PDFResourceManager(caching=True)
for page in PDFPage.get_pages(fp, caching=True):
with StringIO() as output_string:
device = TextConverter(rsrcmgr, output_string, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
print(output_string.getvalue())
About the other question:
parser will not be close on exception
What do you mean by this? As far as I know the parser has no connections that it manages. It gets a file-like object, but that is managed outside the parser.
So I assume you want to extract all the text from a page, inspect the result and save the page number if the result has some property.
If you don't want to use
StringIO.seek
andStringIO.truncate
maybe reinitialize theStringIO
for every page like this:from io import StringIO from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.utils import open_filename laparams = LAParams() with open_filename('your.pdf', "rb") as fp: rsrcmgr = PDFResourceManager(caching=True) for page in PDFPage.get_pages(fp, caching=True): with StringIO() as output_string: device = TextConverter(rsrcmgr, output_string, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) interpreter.process_page(page) print(output_string.getvalue())
Less code, I've tested it, same performance.
About the other question:
parser will not be close on exception
What do you mean by this? As far as I know the parser has no connections that it manages. It gets a file-like object, but that is managed outside the parser.
From the view of api, parse.close()
should be called when parser
is not used anymore.
Maybe I have to use try ... finally ..
to make sure parse.close()
will be called.
Less code, I've tested it, same performance.
On second thought, I think the speed of StringIO
is neglectable compared to the speed of pdfminer itself.
A try, except, finally indeed makes sense.
On second thought, I think the speed of
StringIO
is neglectable compared to the speed of pdfminer itself.
But pdfminer has no api to extract text page by page, so StringIO
is used, any better way is appreciated.
Yes, I get that using StringIO
is a bit cumbersome. But using input and output streams in the way it was setup and its difficult to change. Anyway, besides being cumbersome it doesn't give any performance penalty.
Questions:
StringIO.seek
andStringIO.truncate
seems not efficient enough, any better way?parser
will not be close on exception, butwith PDFParser(fp) as parser
not work, any good way?