jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

Get the Text associated with the hyperlinks - PdfPlumber #940

Closed mukundhareddy1996 closed 11 months ago

mukundhareddy1996 commented 11 months ago

I am not able to get the text associated with the hyperlinks.

We all know, that after loading the document. Using page.annots or page.hyperlinks. we are able to get the links and also their rect boxes. But I don't see any text associated with it.

In simple words: I want to extract the hyperlink from here.

And rephrase the extracted link as below: I want to extract the hyperlink: https://google.com from here.

To do the above scenario. I need to get the text associated with the hyperlink. But I don't see any option to get the text. Getting just the link with out the associated text does not make any scense.

I have tried to intersect the rect of hyperlinks (URI) with the char, but nothing was matching

import pdfplumber 
pdf = pdfplumber.open("Any file with hyperlinks in it")
page = pdf.pages[0] 
for char in page.objects["char"]: 
    if char["x0"] in ["258.11","121.35","82.2"]:
        print("identified the character of hyperlink")

Environment

jsvine commented 11 months ago

Hi @mukundhareddy1996, and thanks for your interest in this library. Unfortunately, hyperlinks in PDFs do not automatically come associated with text. Your instinct to find the text overlapping with the hyperlink is correct, although I don't quite understand the code example you've shared. But doing something like this this (using a PDF available in this repository's tests/pdfs directory) seems to work fine:

import pdfplumber
pdf = pdfplumber.open("tests/pdfs/pdffill-demo.pdf")
page = pdf.pages[0]
link = page.hyperlinks[0]

text = page.crop((
    link["x0"],
    link["top"],
    link["x1"],
    link["bottom"],
)).extract_text()

print(text)

... printing:

Online Help
mukundhareddy1996 commented 11 months ago

thanks @jsvine. Your understanding is correct. I require the text associated with the hyperlinks. Thanks for your logic. I have done a basic test with the above logic and it is working. Thank you😊