pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.94k stars 930 forks source link

Please remove the \n while text and coordinate extraction. #814

Open Laxmi530 opened 2 years ago

Laxmi530 commented 2 years ago

Feature request Hai,

Thank you for providing such a beautiful library.

I am trying to extract text and the coordinate of that text. I tried the below code and able to extract the text and the coordinate but the problem is that in text it include \n which is impacting the coordinates. So can you please have a look into this to avoid the \n and to get the exact coordinate of the text or is there any method please guide me.

from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

with open(file, 'rb') as file_pdf:
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    pages = PDFPage.get_pages(file_pdf)
    txt = []
    box = []
    for page in pages:
        interpreter.process_page(page)
        layout = device.get_result()
        for lobj in layout:
            if isinstance(lobj, LTTextBox):
                for i in lobj:
                    txt.append(i.get_text())
                    box.append(list(i.bbox))

txt[0:2] This is the output ['Corporate deductions\n', '1. Depreciation and amortization\n']

box[0:2] This is the output

[[46.8199, 731.6683, 178.4599, 743.6683],
 [46.8199, 714.72472, 185.96764, 724.74472]]

Thanking you in advance.

pettzilla1 commented 2 years ago

hey, I believe the coordinates are based off where the text boxes are if the box has been initially defined containing line breaks this will be why the line break is included in the text box.