The problem I have been facing is some of the PDFs I have been scrapping for text information will occasionally have a figure in the PDF will have over 10000 individual objects. I want to collect text data and LTLine/LTCurves data, along with it's location (for table analysis), but I do not need figures for the analysis.
I have been trying to figure out if there is a way to not to collect figure data to speed up my script.
The problem I have been facing is some of the PDFs I have been scrapping for text information will occasionally have a figure in the PDF will have over 10000 individual objects. I want to collect text data and LTLine/LTCurves data, along with it's location (for table analysis), but I do not need figures for the analysis. I have been trying to figure out if there is a way to not to collect figure data to speed up my script.
This is the script I am using to extract PDF data. extract_pdf_layout.docx
The script will get hung up and take over 10 minutes for 1 page in the line: interpreter.process_page(page)
I just want to know if there is a way to skip over collecting figure data, to speed up my script.
Thank you