py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.29k stars 1.4k forks source link

Is there a way to extract content by outlines using PyPDF2? #1857

Closed bottergpt closed 1 year ago

bottergpt commented 1 year ago

Hi, is there a way to extract content by outlines using PyPDF2?

pubpub-zz commented 1 year ago

Yes : https://pypdf.readthedocs.io/en/stable/modules/PdfReader.html?highlight=bookmarks#pypdf.PdfReader.outline

you should find some exemple /experience in closed issues

pubpub-zz commented 1 year ago

@zhangqibot, ok to close it ?

pubpub-zz commented 1 year ago

without answer, I assume the answer is sufficient

bottergpt commented 1 year ago

Yes : https://pypdf.readthedocs.io/en/stable/modules/PdfReader.html?highlight=bookmarks#pypdf.PdfReader.outline

you should find some exemple /experience in closed issues

@pubpub-zz Sorry for the late response. I have checked the info you provided, but I still haven't solved it. I can extract the text from page pdf.pages[0].extract_text(), but I can't do it with pdf.outline[0] ... It seems like I should get the exact coordinates of the text in a outline at first and then extract it with extract_text() method of a page?

pubpub-zz commented 1 year ago

outilines are not "areas" but are just storing some destinations. They are structured in dictionnary like this one: {'/Title': 'Contents', '/Page': IndirectObject(85548, 0, 1893950978944), '/Type': '/XYZ', '/Left': 132, '/Top': 578, '/Zoom': NullObject} You will be able to find the Page. Inhere "/XYZ" indicates some coordinates but they can be pagewidth zoom,... (cf pdf reference) and then the coordinates or other information (in accordance with the type of destination) once you will have computer the coordinates you are looking for, you will have to use extract_text() with a visitor function (look in pypdf documentation)