pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.52k stars 446 forks source link

doc.saveIncr() error and doc.close error #3516

Closed ragebear00 closed 1 month ago

ragebear00 commented 1 month ago

Description of the bug

the pdf cannot be saved after get_text(), and cannot close until close the python entirely.

The errror occurs in 1.23.26 and 1.24.4

errors are

Traceback (most recent call last): File "C:/_a/test.py", line 11, in doc.saveIncr() File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz__init.py", line 5380, in saveIncr return self.save(self.name, incremental=True, encryption=mupdf.PDF_ENCRYPT_KEEP) File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\init__.py", line 5301, in save return extra.Document_save( File "C:\Users\x\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\extra.py", line 120, in Document_save return _extra.Document_save(*args) RuntimeError: code=2: Can't do incremental writes on a repaired file

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/_a/test.py", line 15, in os.remove(path) PermissionError: [WinError 32] The process cannot access the file because it is being used by anoth

How to reproduce the bug

run the code below with the pdf uploaded

`import os import fitz

path = r"C:_a\issue\1 - Copy.pdf" doc = fitz.open(path)

for page in doc: print(page.get_text())

try: doc.saveIncr() except: doc.save(path+'temp.pdf',deflate=True, garbage=3) doc.close() os.remove(path) os.rename(path+'temp.pdf',path)
1 - Copy.pdf `

PyMuPDF version

1.24.4

Operating system

Windows

Python version

3.10

JorjMcKie commented 1 month ago

Incremental saves are not always possible: a number of situations will prevent this. Instead of trying / excepting you can check doc.can_save_incrementally() and only do an incremental save if True is returned. The following script works flawlessly:

import pymupdf as fitz
doc=fitz.open("1.-.Copy.pdf")
doc.can_save_incrementally()
1
doc.is_repaired
False
for page in doc:
    _=page.get_text()
    print(f"Processed {page.number=}")

MuPDF error: format error: object (44 0 R) was not found in its object stream
MuPDF error: format error: object (35 0 R) was not found in its object stream
Processed page.number=0
Processed page.number=1
Processed page.number=2
Processed page.number=3
Processed page.number=4
Processed page.number=5
Processed page.number=6
Processed page.number=7
doc.is_repaired
True
doc.can_save_incrementally()
0
doc.ez_save("temp.pdf")
doc.close()
import os
os.remove(doc.name)

After a failing incremental save, obviously an additional reference count is added to the file handle which is not removed on closing the document. This happens on Windows only - no problem on Linux at least.

We will look into this. In the meantime, please use doc = None or del doc after closing. This will triger an additional reference count reduction and the removal will succeed.

ragebear00 commented 1 month ago

many thanks! As you did again the beginning, doc.can_save_incrementally() is 1 before get_text(), just not realize the get_text() will cause "repair" and doc.can_save_incrementally() = false. Will check again before every saveIncr in the future.