pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

The position box obtained through the get_text() method is inaccurate #3600

Open 1339503169 opened 1 week ago

1339503169 commented 1 week ago

Description of the bug

I encountered a case while processing the file, which is a readable PDF. However, there is a significant deviation between the location information obtained by the pymupdf get_text ('words') method and the actual location

How to reproduce the bug

8989fa66-9bff-4f0c-9f05-37c8a393207e pdf_0 image

Multiplying the coordinates by two is because I scaled the image twice when producing it

import fitz # PyMuPDF import cv2

doc = fitz.open("data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf")

page = doc.load_page(0)

words = page.get_text("words")

image = cv2.imread('data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf_0.png')

for word in words: x0, y0, x1, y1, text, block_no, line_no, word_no = word x0, y0, x1, y1 = [int(i) * 2 for i in [x0, y0, x1, y1]] cv2.rectangle(image, (x0, y0), (x1, y1), (255, 0, 0), 2)

cv2.imshow('demo', image) cv2.waitKey(0) 8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf

Why does this situation occur and how can I obtain the correct location information

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

JorjMcKie commented 1 week ago

This seems to be a problem of the fonts embedded in this file. Currently investigating. The first finding is that MuPDF itself computes the coordinates in the same way.

JorjMcKie commented 1 week ago

Solution: Use pymupdf.TOOLS.set_small_glyph_heights(True) right after the import / before any search or extraction. This will force PyMuPDF to recompute the character bboxes. When marking the words based on this, you will get correct results:

pymupdf.TOOLS.set_small_glyph_heights(True)
words = page.get_text("words")

Result: image

JorjMcKie commented 1 week ago

MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707833

1339503169 commented 1 week ago

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

JorjMcKie commented 1 week ago

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

Universal? No - but it is always available. Whether or not the fonts in your PDF are broken is currently under investigation by the MuPDF team - see above link.