Closed tsoernes closed 3 weeks ago
- Some PDFs consists of multiple subdocuments with multiple ToCs. If would like to retrieve all and not the first.
Not sure what you mean "subdocument". If a file is embedded in a PDF, then it can be extracted and opened.
- Some PDFs have a List of Figures / List of Tables that have the same structure as the Table of Contents (with Figure number, Figure name, and page number). I would like to get those as well)
This information is stored as standard text. There is no way to identify it by using whatever meta-information in the PDF. So you are on your own here.
- Some PDFs have a List of Figures / List of Tables that have the same structure as the Table of Contents (with Figure number, Figure name, and page number). I would like to get those as well)
This information is stored as standard text. There is no way to identify it by using whatever meta-information in the PDF. So you are on your own here.
No. In the PDFs that I have, it is just like the ToC, with clickable links and page numbers. It is exactly the same as the ToC in some cases,
Any reference / index / list of <whatever>
outside PDF's own internal TOC (the stuff accessible via doc.get_toc()
) is standard text - even when also covered by links.
What I am saying is, that there is no way to find or interpret this in any other way than by text extraction combined with additional, semantically sensitive code. We do not intend to provide such support in any foreseeable future.
Describe the solution you'd like