Closed larytet closed 4 years ago
@larytet According to PDF specification hypertext links are not a part of page content, they are annotations (see 12.5 Annotations), which help user to interact with document. That's why you should access it through document structure, rather than document view.
Annotations can be defined in Page
objects and accessible through document Catalog
which is SimplePDFViewer.doc.root
. URL annotations have subtype of Link
.
Here is the code:
>>> import pdfreader
>>> fd = open("my-link.pdf", "rb")
>>> viewer = pdfreader.SimplePDFViewer(fd)
>>> first_page = viewer.doc.root.Pages.Kids[0]
>>> links = [annot.A.URI for annot in first_page.Annots if annot.Subtype == 'Link']
>>> links
[b'http://mylink.com/']
Unlike HTML, PDF defines a link as a part of viewing area, it is not a text property/attribute. In this specific case it's just a rectangle area:
>>> first_page.Annots
[{'Type': 'Annot', 'Subtype': 'Link', 'Border': [0, 0, 0], 'Rect': [56, Decimal('721.4'), Decimal('123.7'), Decimal('735.2')], 'A': {'Type': 'Action', 'S': 'URI', 'URI': b'http://mylink.com/'}}]
I'm going to add a section about links to the project docs.
Hi,
I have a simple PDF, https://github.com/larytet/YALAS/releases/download/12/my-link.pdf, containing a hyperlink. My goal is to get the reference and the link text.
How do I get the reference itself http://mylink.com ? Thanks