pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.78k stars 534 forks source link

OCR Coordinates do not match #4068

Closed xiaolibuzai-ovo closed 1 week ago

xiaolibuzai-ovo commented 1 week ago

Description of the bug

I used another OCR to recognize the content coordinates of the PDF, and then I used the PyMuPDF library. I hope to extract the coordinates of a specified area, but there is a significant difference between the two sets of coordinates.

These are the coordinates recognized by the other OCR: { "text": "Vue Mastery", "bbox": [ 586.0, 178.0, 1250.0, 296.0 ], "type": "ocr", "score": 1 } These are the coordinates for the corresponding position in PyMuPDF: (88.85449981689453, 23.943227767944336, 117.37201690673828, 44.796356201171875, 'Vue', 0, 0, 0), (121.81803131103516, 23.943227767944336, 183.36544799804688, 44.796356201171875, 'Mastery', 0, 0, 1)

this is pdf file Nuxtjs-Cheat-Sheet.pdf

How to reproduce the bug

see above

Hope to be answered

PyMuPDF version

1.24.14

Operating system

MacOS

Python version

3.10

JorjMcKie commented 1 week ago

Look at this code:

image

Looks like the PyMuPDF coordinates are correct. That mysterious "other OCR" tool provides coordinates outside the dimension of an A4 page: x-values should not exceed 596, but we see a value 1250.

I also do not understand why we talk about OCR at all: the text can be extracted with no problem, and none of the 42 images covers the page.

xiaolibuzai-ovo commented 1 week ago

Look at this code: 看这段代码:

image

Looks like the PyMuPDF coordinates are correct. That mysterious "other OCR" tool provides coordinates outside the dimension of an A4 page: x-values should not exceed 596, but we see a value 1250.看起来 PyMuPDF 坐标是正确的。这个神秘的“其他 OCR”工具提供 A4 页面尺寸之外的坐标:x 值不应超过 596,但我们看到的值是 1250。

I also do not understand why we talk about OCR at all: the text can be extracted with no problem, and none of the 42 images covers the page.我也不明白为什么我们要谈论OCR:文本可以毫无问题地提取,并且42张图像没有一张覆盖页面。

Thank you for your reply. I am currently facing this issue: using PyMuPDF to recognize the PDF leads to inaccuracies in the content recognition. For example, I used the translation script from https://github.com/pymupdf/PyMuPDF-Utilities/blob/tutorials/tutorials/language-translation/translator.py, but the restored content differs significantly. Here is my PDF: Nuxtjs-Cheat-Sheet.pdf result:

企业微信截图_2c1fd6cc-dc5e-4671-b16c-7f4580a39388

I can accurately select the positions using another OCR, so my idea is to have the OCR find the positions, and then translate and write back by extracting the matrix content.

other OCR result: vue pdf_0

JorjMcKie commented 1 week ago

Still don't understand why we even talk about OCR. PyMuPDF can correctly detect all text natively without any problem and top precision: image

I think we are talking past each other:

Language translation of a given document maybe?

JorjMcKie commented 1 week ago

I am transferring this post to "Discussions" as we are clearly not dealing with a bug.