Open 1339503169 opened 1 week ago
This seems to be a problem of the fonts embedded in this file. Currently investigating. The first finding is that MuPDF itself computes the coordinates in the same way.
Solution:
Use pymupdf.TOOLS.set_small_glyph_heights(True)
right after the import / before any search or extraction.
This will force PyMuPDF to recompute the character bboxes. When marking the words based on this, you will get correct results:
pymupdf.TOOLS.set_small_glyph_heights(True)
words = page.get_text("words")
Result:
MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707833
Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction
Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction
Universal? No - but it is always available. Whether or not the fonts in your PDF are broken is currently under investigation by the MuPDF team - see above link.
Description of the bug
I encountered a case while processing the file, which is a readable PDF. However, there is a significant deviation between the location information obtained by the pymupdf get_text ('words') method and the actual location
How to reproduce the bug
Multiplying the coordinates by two is because I scaled the image twice when producing it
import fitz # PyMuPDF import cv2
doc = fitz.open("data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf")
page = doc.load_page(0)
words = page.get_text("words")
image = cv2.imread('data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf_0.png')
for word in words: x0, y0, x1, y1, text, block_no, line_no, word_no = word x0, y0, x1, y1 = [int(i) * 2 for i in [x0, y0, x1, y1]] cv2.rectangle(image, (x0, y0), (x1, y1), (255, 0, 0), 2)
cv2.imshow('demo', image) cv2.waitKey(0) 8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf
Why does this situation occur and how can I obtain the correct location information
PyMuPDF version
1.24.5
Operating system
Windows
Python version
3.8