Closed sambitdash closed 5 years ago
@gwierzchowski Please review this with relation to your comment on #34.
In my implementation I only use Outline entry from PDF Catalog.
catalog = pdDocGetCatalog(doc)
toc_ref = get(catalog, cn"Outlines")
# ...
It is optional, so method can return nothing if there is no such entry. It matches the case when most GUI viewers display TOC panel. It is also what Python library PyPDF2 returns from getOutlines(). I think bookmark annotations is different functionality (for different function). I agree that 2. and 3. is out of scope - it is matter for client applications specific for certain PDF files or maybe for some code written as example.
@gwierzchowski your understanding is for case 1. That's perfect. The outlines and bookmarks are synonymous in PDF at places. Hence, the confusion. 2 is a good use case though. You can virtually create complete HTML like tagged interpretations when documents have such nice representations, But very seldom creators map tags properly though. Outlines are used in almost all PDFs. So extracting can be really helpful.
https://acrobatusers.com/tutorials/how-do-i-add-bookmarks-to-a-pdf-document
The video shows how users can add bookmarks to a PDF document. Hence, there is a misconception that bookmarks are annotations. However, from the PDF specification point of view bookmarks are not annotations.
Submitted PR with implementation proposal.
PDF document outlines can be extracted from 3 distinct sources:
The scope of
PDFIO
is only 1 and 2. 3 can be created as a separate module overPDFIO
to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.