Closed markfirmware closed 3 years ago
You can obtain the location of annotations via page.annots
; each should have its bounding box specified by the (x0
, top
, x1
, bottom
) properties. You should be able to pass those coordinates to page.crop(...)
, and then use .extract_text(...)
to determine what text is below the annotation.
A toy example:
import pdfplumber
from pdfplumber.utils import obj_to_bbox
pdf = pdfplumber.open("path.pdf")
page = pdf.pages[0]
a = page.annots.annots[0]
cropped = page.crop(obj_to_bbox(a))
print(cropped.extract_text())
Does this answer your question?
Thank you! I will be able to confirm the solution in the next couple of days. Thanks again!
Yes, this works. Thanks!
(Asked previously at https://github.com/pdfminer/pdfminer.six/issues/536 which produced a good answer - the list of the destinations of all the links - thanks @pietermarsman - but I still need the text that is seen in the rectangle of the link. In the example below, the text I am expecting is RXD.)
On page 154: https://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.1.pdf#page=154
there is a link from RXD to:
https://infocenter.nordicsemi.com/pdf/nRF51_RM_v3.0.1.pdf#%5B%7B%22num%22%3A2105%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C56.692%2C610.554%2Cnull%5D
How can all of these internal links be extracted?
Thanks, Mark