Closed bottergpt closed 1 year ago
you should find some exemple /experience in closed issues
@zhangqibot, ok to close it ?
without answer, I assume the answer is sufficient
you should find some exemple /experience in closed issues
@pubpub-zz
Sorry for the late response. I have checked the info you provided, but I still haven't solved it.
I can extract the text from page pdf.pages[0].extract_text()
, but I can't do it with pdf.outline[0]
... It seems like I should get the exact coordinates of the text in a outline at first and then extract it with extract_text()
method of a page?
outilines are not "areas" but are just storing some destinations. They are structured in dictionnary like this one:
{'/Title': 'Contents', '/Page': IndirectObject(85548, 0, 1893950978944), '/Type': '/XYZ', '/Left': 132, '/Top': 578, '/Zoom': NullObject}
You will be able to find the Page. Inhere "/XYZ" indicates some coordinates but they can be pagewidth zoom,... (cf pdf reference) and then the coordinates or other information (in accordance with the type of destination)
once you will have computer the coordinates you are looking for, you will have to use extract_text()
with a visitor function (look in pypdf documentation)
Hi, is there a way to extract content by outlines using PyPDF2?