pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

The text information obtained by get_text() is partially missing #3620

Open 1339503169 opened 5 days ago

1339503169 commented 5 days ago

Description of the bug

mscbookin.pdf mscbookin pdf_0 image

I encountered an issue while processing the file, where the string obtained using the get_text() method was missing some data compared to the original PDF

The reason why the coordinate information is multiplied by 2 is because I applied double scaling when generating the image

How to reproduce the bug

import fitz
import cv2
file_path = 'data/mscbookin.pdf'
png_path = 'data/mscbookin.pdf_0.png'

pdf = fitz.open(file_path)
page = pdf.load_page(0)
image = cv2.imread(png_path)

blocks = page.get_text(option='dict', clip=fitz.INFINITE_IRECT())['blocks']

for item in blocks:
    x1, y1, x2, y2 = [int(i) * 2 for i in list(item['bbox'])]
    cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

cv2.imshow('Image with Rectangle', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

JorjMcKie commented 5 days ago

There is a difference in the behavior of the base library. I am going to transfer this report to MuPDF's issue tracker and report the tracking number here.

JorjMcKie commented 5 days ago

Test outputs: mutool-12311.txt mutool-12404.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707843