jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Can internal links be extracted? #318

Closed markfirmware closed 3 years ago

markfirmware commented 3 years ago

(Asked previously at https://github.com/pdfminer/pdfminer.six/issues/536 which produced a good answer - the list of the destinations of all the links - thanks @pietermarsman - but I still need the text that is seen in the rectangle of the link. In the example below, the text I am expecting is RXD.)

On page 154: https://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.1.pdf#page=154

there is a link from RXD to:

https://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.1.pdf#%5B%7B%22num%22%3A2105%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C56.692%2C610.554%2Cnull%5D

How can all of these internal links be extracted?

Thanks, Mark

jsvine commented 3 years ago

You can obtain the location of annotations via page.annots; each should have its bounding box specified by the (x0, top, x1, bottom) properties. You should be able to pass those coordinates to page.crop(...), and then use .extract_text(...) to determine what text is below the annotation.

A toy example:

import pdfplumber
from pdfplumber.utils import obj_to_bbox
pdf = pdfplumber.open("path.pdf")
page = pdf.pages[0]
a = page.annots.annots[0]
cropped = page.crop(obj_to_bbox(a))
print(cropped.extract_text())

Does this answer your question?

markfirmware commented 3 years ago

Thank you! I will be able to confirm the solution in the next couple of days. Thanks again!

markfirmware commented 3 years ago

Yes, this works. Thanks!