Get hyperlinks from a PDF

@larytet According to PDF specification hypertext links are not a part of page content, they are annotations (see 12.5 Annotations), which help user to interact with document. That's why you should access it through document structure, rather than document view.

Annotations can be defined in Page objects and accessible through document Catalog which is SimplePDFViewer.doc.root. URL annotations have subtype of Link.

Here is the code:

>>> import pdfreader
>>> fd = open("my-link.pdf", "rb")
>>> viewer = pdfreader.SimplePDFViewer(fd)
>>> first_page = viewer.doc.root.Pages.Kids[0]
>>> links = [annot.A.URI for annot in first_page.Annots if annot.Subtype == 'Link']
>>> links
[b'http://mylink.com/']

Unlike HTML, PDF defines a link as a part of viewing area, it is not a text property/attribute. In this specific case it's just a rectangle area:

>>> first_page.Annots
[{'Type': 'Annot', 'Subtype': 'Link', 'Border': [0, 0, 0], 'Rect': [56, Decimal('721.4'), Decimal('123.7'), Decimal('735.2')], 'A': {'Type': 'Action', 'S': 'URI', 'URI': b'http://mylink.com/'}}]

I'm going to add a section about links to the project docs.

maxpmaxp / pdfreader

Get hyperlinks from a PDF #43