pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

page.get_text() cause process freeze with certain pdf on v1.24.2 #3430

Closed dotrunghieu96 closed 1 month ago

dotrunghieu96 commented 2 months ago

Description of the bug

I tried extracting text from the following file Speculative Investor Behavior in a Stock Market with Heterogeneous Expectations

Edit - attached the file Harrison & Kreps (1978).pdf

On v1.24.2, when I call page.get_text(), the process stuck on very high cpu usage and the method is stuck, not returning or continue the script.

On https://pymupdf.io/ with version 1.23.5, the text can be extracted just fine

How to reproduce the bug

import fitz
# download and save the file to a local storage
file = "C:/Users/DELL/Downloads/sample_scanned/pdf_1978.pdf"
# open the file with fitz
doc = fitz.open(file)
# print the text
for page in doc:
    print(page.get_text()) --> **process stuck here**

Tried on both Windows and Linux, and python3.11 docker image

PyMuPDF version

1.24.2

Operating system

Linux

Python version

3.11

JorjMcKie commented 2 months ago

Duplicate of #3357.

julian-smith-artifex-com commented 1 month ago

Fixed in 1.24.3.