jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

hyperlinks have negative height #845

Open bentsi opened 1 year ago

bentsi commented 1 year ago

Describe the bug

hyperlink height property has negative height value.

Code to reproduce the problem

1) open pdf 2) see pdf_file.pages[61].hyperlinks

PDF file

https://www.singtel.com/content/dam/singtel/about-us/sustainability/reports/Singtel-Group-Sustainability-Report-2022.pdf

Expected behavior

height should be positive number

Actual behavior

height has negative value

Screenshots

image

Environment

Additional context

in addition we can see that "top" and "bottom" attributes are swapped, that doesn't comply with pdfplumber's bounding box definitions as discussed in https://github.com/jsvine/pdfplumber/issues/198

jsvine commented 1 year ago

Hi @bentsi, thanks for sharing this example. The height, top, and bottom attributes are all calculated from the raw annotation's Rect (bounding box), specified by the PDF in a direct command.

In this particular PDF (as observed by opening it in a text editor), that Rect command is Rect[428.053 634.536 453.041 626.144], which corresponds to exactly what you see for x0, y0, x1, y1 in your screenshot above, suggesting that pdfplumber is collecting the correct information.

Given that, there would seem to be two main options:

My inclination is toward the first option, because trying to fix PDF-creator's mistakes seems like opening a can of worms. But I'm open to suggestions otherwise.