gwk / pdfminer3

Python 3 fork of pdfminer/pdfminer.six.
MIT License
45 stars 13 forks source link

While performing Layout analysis, do not collect certain LT elements #5

Open halfghost opened 5 years ago

halfghost commented 5 years ago

The problem I have been facing is some of the PDFs I have been scrapping for text information will occasionally have a figure in the PDF will have over 10000 individual objects. I want to collect text data and LTLine/LTCurves data, along with it's location (for table analysis), but I do not need figures for the analysis. I have been trying to figure out if there is a way to not to collect figure data to speed up my script.

This is the script I am using to extract PDF data. extract_pdf_layout.docx

The script will get hung up and take over 10 minutes for 1 page in the line: interpreter.process_page(page)

I just want to know if there is a way to skip over collecting figure data, to speed up my script.

Thank you