Get List of Figures / Get List of Tables / Get multiple Table of Contents

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

5.78k stars 534 forks source link

Get List of Figures / Get List of Tables / Get multiple Table of Contents #4027

Closed tsoernes closed 3 weeks ago

tsoernes commented 3 weeks ago

Describe the solution you'd like

Some PDFs consists of multiple subdocuments with multiple ToCs. If would like to retrieve all and not the first.
Some PDFs have a List of Figures / List of Tables that have the same structure as the Table of Contents (with Figure number, Figure name, and page number). I would like to get those as well)

JorjMcKie commented 3 weeks ago

Some PDFs consists of multiple subdocuments with multiple ToCs. If would like to retrieve all and not the first.

Not sure what you mean "subdocument". If a file is embedded in a PDF, then it can be extracted and opened.

Some PDFs have a List of Figures / List of Tables that have the same structure as the Table of Contents (with Figure number, Figure name, and page number). I would like to get those as well)

This information is stored as standard text. There is no way to identify it by using whatever meta-information in the PDF. So you are on your own here.

tsoernes commented 3 weeks ago

Some PDFs have a List of Figures / List of Tables that have the same structure as the Table of Contents (with Figure number, Figure name, and page number). I would like to get those as well)

This information is stored as standard text. There is no way to identify it by using whatever meta-information in the PDF. So you are on your own here.

No. In the PDFs that I have, it is just like the ToC, with clickable links and page numbers. It is exactly the same as the ToC in some cases,

JorjMcKie commented 3 weeks ago

Any reference / index / list of <whatever> outside PDF's own internal TOC (the stuff accessible via doc.get_toc()) is standard text - even when also covered by links.

What I am saying is, that there is no way to find or interpret this in any other way than by text extraction combined with additional, semantically sensitive code. We do not intend to provide such support in any foreseeable future.