maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Get hyperlinks from a PDF #43

Closed larytet closed 4 years ago

larytet commented 4 years ago

Hi,

I have a simple PDF, https://github.com/larytet/YALAS/releases/download/12/my-link.pdf, containing a hyperlink. My goal is to get the reference and the link text.

fd = open("my-link.pdf", "rb")
viewer = pdfreader.SimplePDFViewer(fd)
viewer.navigate(1)
viewer.render()
viewer.canvas.strings  # I get correct text "MyLink Text"
viewer.canvas.text_content # Encoded (raw) representation of "MyLink Text" ?
viewer.stream # I get b'0.1 w\nq 0 0 612 792 re\nW* n\n .... 
viewer.resources.XObject # Empty dictionary {}

How do I get the reference itself http://mylink.com ? Thanks

maxpmaxp commented 4 years ago

@larytet According to PDF specification hypertext links are not a part of page content, they are annotations (see 12.5 Annotations), which help user to interact with document. That's why you should access it through document structure, rather than document view.

Annotations can be defined in Page objects and accessible through document Catalog which is SimplePDFViewer.doc.root. URL annotations have subtype of Link.

Here is the code:

>>> import pdfreader
>>> fd = open("my-link.pdf", "rb")
>>> viewer = pdfreader.SimplePDFViewer(fd)
>>> first_page = viewer.doc.root.Pages.Kids[0]
>>> links = [annot.A.URI for annot in first_page.Annots if annot.Subtype == 'Link']
>>> links
[b'http://mylink.com/']

Unlike HTML, PDF defines a link as a part of viewing area, it is not a text property/attribute. In this specific case it's just a rectangle area:

>>> first_page.Annots
[{'Type': 'Annot', 'Subtype': 'Link', 'Border': [0, 0, 0], 'Rect': [56, Decimal('721.4'), Decimal('123.7'), Decimal('735.2')], 'A': {'Type': 'Action', 'S': 'URI', 'URI': b'http://mylink.com/'}}]

I'm going to add a section about links to the project docs.