jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Annotation coordinates mismatch on landscape-oriented pages #403

Closed joesmith0 closed 2 years ago

joesmith0 commented 3 years ago

Describe the bug

(In the context of PDF documents with a mix of landscape and portrait-oriented pages). On the landscape-oriented pages, the bounding boxes for highlight annotations (x0, x1, top, bottom) do not match the expected values for the words they should encapsulate. x0 can be negative in these cases which suggests the annotation thinks the page is in a portrait orientation.

Code to reproduce the problem

with pdfplumber.open(fn) as pdf:
    annot = pdf.annots[0]
    print({key: annot[key] for key in annot if key in ['x0', 'x1', 'top', 'bottom']}, end='\n')

    page = pdf.pages[annot['page_number']-1]
    words = [word for word in page.extract_words() if word['text'] in ['December', '2019']]

    for word in words:
        print(
            str({key: word[key] for key in word if key in ['x0', 'x1', 'top', 'bottom', 'text']})
            + '\n'
        )

Expected behavior

The bounding boxes of the words outputted above should be contained within the bounding box of the annotation.

Actual behavior

The bounding box of the annotation is completely off the target values (sometimes negative).

Environment

samkit-jain commented 3 years ago

Hi @joesmith0 Appreciate you using the library and raising a bug report. In order to further investigate this, request you to share a PDF that demonstrates this issue as well. Please remove any sensitive information from the PDF before sharing it.

SasiAravind commented 3 years ago

Hi @samkit-jain @jsvine Is there any way to search keywords containing space e.x =['Team size','Company Name']. while using "page.extract_words()"

joesmith0 commented 3 years ago

Hey @samkit-jain, unfortunately I wasn't able to get the redaction tool to work on my PDFs... Do you have any PDFs at hand with a mix of portrait and landscape pages? The characters should all be upright (same direction) despite the orientation.

jsvine commented 2 years ago

Closing this issue due to inactivity, lack of issue-reproducing PDF, and lack of other users expressing similar issues. Feel free to continue the discussion, however, especially if someone comes across a PDF that allows us to reproduce.

jsvine commented 3 months ago

Just leaving a note here that the likely issue (hard to confirm given lack of PDF) is likely fixed in v0.11.1 — specifically commit aaa35c9.