sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Outlines from PDF documents should be extracted #35

Closed sambitdash closed 5 years ago

sambitdash commented 5 years ago

PDF document outlines can be extracted from 3 distinct sources:

  1. PDF bookmark which show up in Adobe Reader as TOC.
  2. PDF structure from marked content from tagged PDFs
  3. Document structure analysis by learning or heuristics.

The scope of PDFIO is only 1 and 2. 3 can be created as a separate module over PDFIO to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.

sambitdash commented 5 years ago

@gwierzchowski Please review this with relation to your comment on #34.

gwierzchowski commented 5 years ago

In my implementation I only use Outline entry from PDF Catalog.

catalog = pdDocGetCatalog(doc)
toc_ref = get(catalog, cn"Outlines")
# ...

It is optional, so method can return nothing if there is no such entry. It matches the case when most GUI viewers display TOC panel. It is also what Python library PyPDF2 returns from getOutlines(). I think bookmark annotations is different functionality (for different function). I agree that 2. and 3. is out of scope - it is matter for client applications specific for certain PDF files or maybe for some code written as example.

sambitdash commented 5 years ago

@gwierzchowski your understanding is for case 1. That's perfect. The outlines and bookmarks are synonymous in PDF at places. Hence, the confusion. 2 is a good use case though. You can virtually create complete HTML like tagged interpretations when documents have such nice representations, But very seldom creators map tags properly though. Outlines are used in almost all PDFs. So extracting can be really helpful.

sambitdash commented 5 years ago

https://acrobatusers.com/tutorials/how-do-i-add-bookmarks-to-a-pdf-document

The video shows how users can add bookmarks to a PDF document. Hence, there is a misconception that bookmarks are annotations. However, from the PDF specification point of view bookmarks are not annotations.

gwierzchowski commented 5 years ago

Submitted PR with implementation proposal.

sambitdash commented 5 years ago

45 and #49 addressed this to a great extent. Closing.