pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.66k stars 528 forks source link

save and ez_save hangs indefinitely when on some pdfs #3891

Closed cmyers009 closed 1 month ago

cmyers009 commented 1 month ago

Description of the bug

document.is_dirty hangs indefinitely for some pdfs.

How to reproduce the bug

PyMuPDF 1.24.10 python 3.11

I had a relatively normal looking image based pdf with some OCR on it.

The pdf seems non-corrupted and opens in acrobat.

pdfinfo returns this response:

Creator:        pdftk-java 3.3.2
Producer:       itext-paulo-155 (itextpdf.sf.net - lowagie.com)
CreationDate:   10/12/22 10:51:07 Central Daylight Time
ModDate:        10/12/22 10:51:07 Central Daylight Time
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          2185
Encrypted:      no
Page size:      792 x 612 pts (letter)
Page rot:       0
File size:      637399752 bytes
Optimized:      no
PDF version:    1.3

We tried to run the document.is_dirty on this pdf, which normally completes in less than a second, but it froze up for 2 days until we force quit it. The same behavior is reproducible for the same pdf.

Unfortunately, I am unable to provide the source PDF to reproduce this issue. You will just have to take my word on it.

If you need me to run any code on the PDF to help debug, I am able to do that but I can not provide the original pdf.

PyMuPDF version

1.24.10

Operating system

Windows

Python version

3.11

JorjMcKie commented 1 month ago

No, we cannot take your word for it: how could that ever be acted upon? Please provide a reproducer one way or another, e.g. via e-mail.

JorjMcKie commented 1 month ago

On another thought: this property checks whether the pdf has unsaved changes. Is it really that what you want to know? Or would is_repaired actually applicable, or can_save_incrementally?

cmyers009 commented 1 month ago

The goal is to see if it can be saved with garbage zero, so I believe, it would be 'can_save_incrementally'

With this PDF, is_repaired - False can_save_incrementally - True

My apology, It turns out that the is_dirty does not freeze the pdf, it is freezing when you try to .ezsave() or regular .save() for that matter.

cmyers009 commented 1 month ago

We are going to get around this by adding a timeout to the doc.save() and throw an error if it exceeds 2 minutes.

JorjMcKie commented 1 month ago

Ah, thanks for the clarification. We certainly are interested in following up on the problem and I am offering again to use my e-mail for sending it. If incremental save is possible, then little speaks against incremental saves. This is the fastest option in most cases.

JorjMcKie commented 1 month ago

Well, you apparently cannot share an example - even under safe conditions. In order to limit things which bloat our open-issue-list, I am going to close this for now. If you can share an example file in the future, please feel free to re-open or submit a new issue.