I am trying to extract text and the coordinate of that text. I tried the below code and able to extract the text and the coordinate but the problem is that in text it include \n which is impacting the coordinates. So can you please have a look into this to avoid the \n
and to get the exact coordinate of the text or is there any method please guide me.
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
with open(file, 'rb') as file_pdf:
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(file_pdf)
txt = []
box = []
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
for i in lobj:
txt.append(i.get_text())
box.append(list(i.bbox))
txt[0:2]
This is the output
['Corporate deductions\n', '1. Depreciation and amortization\n']
hey,
I believe the coordinates are based off where the text boxes are if the box has been initially defined containing line breaks this will be why the line break is included in the text box.
Feature request Hai,
Thank you for providing such a beautiful library.
I am trying to extract text and the coordinate of that text. I tried the below code and able to extract the text and the coordinate but the problem is that in text it include
\n
which is impacting the coordinates. So can you please have a look into this to avoid the\n
and to get the exact coordinate of the text or is there any method please guide me.txt[0:2]
This is the output['Corporate deductions\n', '1. Depreciation and amortization\n']
box[0:2]
This is the outputThanking you in advance.