pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.8k stars 536 forks source link

Memory leak when opening an invalid PDF with no %%EOF in tail #3344

Closed cmyers009 closed 6 months ago

cmyers009 commented 8 months ago

Description of the bug

If the file bytes are prematurely cut-off, then fitz will open the PDF file with 0 pages, but at the same time, cause a memory leak.

How to reproduce the bug

You can reproduce this bug by taking a large PDF file, and remove the last 50% of the bytes.

If you repeatably load files like this, there will be a memory leak even with a doc.close()

You can add a check if the file has an %%EOF with this code. If you call it before the doc.open() code, then you can return 0 pages without the need to produce the memory leak.

`def has_eof_marker(file_path): try: with open(file_path, 'rb') as file:

Seek to the last 1KB of the file

        file.seek(-1024, os.SEEK_END)
        # Read the last 1KB
        tail = file.read()
        # Check if `%%EOF` is in the last 1KB
        return b'%%EOF' in tail
except Exception as e:
    print(f"Error reading file: {e}")
    return False`

    PyMuPDF 1.23.4

PyMuPDF version

1.23.8 or earlier

Operating system

Windows

Python version

3.11

julian-smith-artifex-com commented 8 months ago

I cannot reproduce this with the current version, PyMuPDF-1.24.1. What version of PyMuPDF are you using?

cmyers009 commented 8 months ago

I am using 1.23.4

On Thu, Apr 4, 2024 at 3:46 PM Julian Smith @.***> wrote:

I cannot reproduce this with the current version, PyMuPDF-1.24.1. What version of PyMuPDF are you using?

— Reply to this email directly, view it on GitHub https://github.com/pymupdf/PyMuPDF/issues/3344#issuecomment-2038184531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVNAD45N7NHNOAWKVSIQJA3Y3W3ZXAVCNFSM6AAAAABFX2BDOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZYGE4DINJTGE . You are receiving this because you authored the thread.Message ID: @.***>

julian-smith-artifex-com commented 8 months ago

There have been quite a few improvements to memory handling since 1.23.4 and so it would be worth retrying with the latest version, 1.24.1.

cmyers009 commented 8 months ago

Ok, I'll check it out.

On Fri, Apr 5, 2024 at 2:22 AM Julian Smith @.***> wrote:

There have been quite a few improvements to memory handling since 1.23.4 and so it would be worth retrying with the latest version, 1.24.1.

— Reply to this email directly, view it on GitHub https://github.com/pymupdf/PyMuPDF/issues/3344#issuecomment-2039128639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVNAD4YE2NR3ZYKPSYNQP53Y3ZGLXAVCNFSM6AAAAABFX2BDOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZZGEZDQNRTHE . You are receiving this because you authored the thread.Message ID: @.***>

julian-smith-artifex-com commented 6 months ago

Closing this because waiting for information for over a month.