pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

提取中文pdf出现乱码 #3538

Closed java668 closed 4 months ago

java668 commented 4 months ago

Description of the bug

Pythonܔزၥᦶ໛ຝҁ෫႕҂

ܔزၥᦶ༷ᬿ

Pythonၥᦶ໛ຝ

՗᫫կຝ຅ጱ᥯ଶ๶᧔҅ၥᦶ๋᯿ᥝጱྍṈฎࣁ᫫կ୏ݎጱ෸ײኴفྲ᫾অ҅ಅզࣁ෱๗ၥᦶጱኴف҅

՗᫫կᕪၧ਍ጱ᥯ଶӤ๶᧔҅ݎሿጱᳯ᷌ᥴ٬౮๜֗҅ಭفጱᩒრྲ᫾੝̶ࢩྌ҅੒Ӟӻၥᦶጱᔮᕹ҅

୏ত๋֯ጱၥᦶ੪ฎრդᎱᕆڦጱၥᦶ҅Ԟ੪ฎܔزၥᦶᴤྦྷ҅ᬯӻᬦᑕԞᤩ౮ԅጮፋၥᦶ̶ܔزၥᦶ

ฎ๋च๜Ԟฎ๋ବ੶ጱၥᦶᔄࣳ҅ܔزၥᦶଫአԭ๋च๜ጱ᫫կդᎱ҅ইᔄ҅ڍහ̶ොဩᒵ҅ܔزၥᦶ

᭗ᬦݢಗᤈጱෙ᥺༄ັᤩၥܔزጱᬌڊฎވჿ᪃ᶼ๗ᕮຎ̶ࣁၥᦶᰂਁरጱቘᦞӤ๶᧔҅᩼ஃӥጱၥᦶ

ಭفᩒრ᩼ṛ҅஑کጱࢧಸሲ᩼य़҅ᥠၥᦶᰂਁरཛྷࣳғ

ಲ୏᫫կຝ຅ጱ੶ᶎ҅ࣁᛔۖ۸ၥᦶጱ֛ᔮӾ҅ܔزၥᦶ໛ຝզ݊ܔزၥᦶጱᎣᦩ֛ᔮฎ஠ᶳᥝഩൎጱ

ದᚆԏӞ҅ܔزၥᦶጱᎣᦩ֛ᔮฎᛔۖ۸ၥᦶૡᑕ૵զ݊ၥᦶ୏ݎૡᑕ૵ጱᎣᦩ֛ᔮԏӞ҅ᘒӬฎ஠ᶳ

ٍ॓ጱᎣᦩԏӞ̶ࣁPython᧍᥺Ӿଫአ๋ଠာጱܔزၥᦶ໛ຝฎunittest޾pytest,unittestંԭຽٵପ҅

ݝᥝਞᤰԧPythonᥴ᯽࢏ݸ੪ݢզፗള੕فֵአԧ,pytestฎᒫӣොጱପ҅ᵱᥝܔᇿጱਞᤰ̶ܔزၥᦶ໛

ຝጱᎣᦩ֛ᔮ੪ࢱᕰunittest޾pytest๶ᦖᥴ̶

ጮፋၥᦶܻቘ pdf文件: Python单元测试框架.pdf

How to reproduce the bug

解析pdf文件出现乱码

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.11

JorjMcKie commented 4 months ago

Please describe in English!

java668 commented 4 months ago

Please describe in English!

Please describe in English! Using this tool to parse PDF Chinese documents resulted in garbled characters. Could you please help me take a look? Thank you very much. PDF document: Python单元测试框架.pdf

JorjMcKie commented 4 months ago

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")
java668 commented 4 months ago

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")

https://github.com/pypdfium2-team/pypdfium2 This can be extracted. Can you help me take a look? Thank you very much

JorjMcKie commented 4 months ago

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

java668 commented 4 months ago

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

好的,Thank you very much

java668 commented 4 months ago

Sorry - as I wrote: this file has severe defects. Whether or not some tools may still be able to extract things despite of this is a matter outside the scope we can deal with.

This PDF is full of errors - see the following log during open:

import pymupdf
doc = pymupdf.open("Python (1).pdf")
print(pymupdf.TOOLS.mupdf_warnings())
format error: cannot recognize xref format
trying to repair broken xref
repairing PDF document
Bad or missing parent pointer in outline tree, repairing
... repeated 4 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 3 times...
Bad or missing prev pointer in outline tree, repairing
Bad or missing parent pointer in outline tree, repairing
... repeated 2 times...
Bad or missing prev pointer in outline tree, repairing

When then saving to just contain the first page, no PDF viewer or extraction tool can extract meaningful text.

doc.select([0])
doc.ez_save("page1.pdf")

How can I determine whether this PDF has errors? Is there a corresponding API? Thank you very much

JorjMcKie commented 4 months ago

How can I determine whether this PDF has errors? Is there a corresponding API?

Some errors are already detected when the PDF is opened - like in this case, where the central cross reference (xref) table is broken. MuPDF will then try to repair things by generating a new xref table from walking through he full file. This is usually accompanied by error and warning messages. Some of those are written to the console, the full message are also stored in the area pymupdf.TOOLS.mupdf_warnings() - as shown.

Whether a repair had been tried can be determined by looking at doc.is_repaired.

Not all errors can be detected at open time though. Some will only be exhibited when certain information is extracted like text or during rendering the pages' visual appearance.

java668 commented 4 months ago

How can I determine whether this PDF has errors? Is there a corresponding API?

Some errors are already detected when the PDF is opened - like in this case, where the central cross reference (xref) table is broken. MuPDF will then try to repair things by generating a new xref table from walking through he full file. This is usually accompanied by error and warning messages. Some of those are written to the console, the full message are also stored in the area pymupdf.TOOLS.mupdf_warnings() - as shown.

Whether a repair had been tried can be determined by looking at doc.is_repaired.

Not all errors can be detected at open time though. Some will only be exhibited when certain information is extracted like text or during rendering the pages' visual appearance.

ok, Thank you very much!