出现pdf完全识别错误的情况

opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取。

https://opendatalab.com/OpenSourceTools

GNU Affero General Public License v3.0

11.11k stars 825 forks source link

出现pdf完全识别错误的情况 #251

Open JustDoIt166 opened 1 month ago

JustDoIt166 commented 1 month ago

Description of the bug | 错误描述

屏幕截图 2024-07-31 005335 试了两个pdf，第一个有大量数学公式的pdf识别效果很不错，而且比marker快很多。但第二个完全识别错误屏幕截图 2024-07-31 005326

How to reproduce the bug | 如何复现

这是使用的pdf potracelib_origin.pdf layout.pdf

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

myhloli commented 1 month ago

There's an issue with the conversion between landscape and portrait versions of this PDF. When previewed, it appears in landscape orientation, but when fed into the model for recognition within the program, it turns into portrait orientation. This causes a misalignment in layout direction, affecting the final output of results. We appreciate you providing the sample, and we will promptly identify the problem area and fix it going forward.

myhloli commented 1 month ago

@JustDoIt166 A temporary workaround involves adding parameters to the command line

--method ocr

Forcing the use of OCR mode for parsing can temporarily resolve the issue of disordered vertical and horizontal layout. A comprehensive fix, however, will require some additional time.

JustDoIt166 commented 1 month ago

@JustDoIt166 A temporary workaround involves adding parameters to the command line
--method ocr
Forcing the use of OCR mode for parsing can temporarily resolve the issue of disordered vertical and horizontal layout. A comprehensive fix, however, will require some additional time.

是的，这个pdf 的rotation 为90,导致pymupdf识别文本框的x,y坐标颠倒了。可以通过页面大小计算出正确的xy坐标，但似乎有点麻烦，不知道这个问题是否常见

myhloli commented 1 month ago

@JustDoIt166 A temporary workaround involves adding parameters to the command line
--method ocr
Forcing the use of OCR mode for parsing can temporarily resolve the issue of disordered vertical and horizontal layout. A comprehensive fix, however, will require some additional time.
是的，这个pdf 的rotation 为90,导致pymupdf识别文本框的x,y坐标颠倒了。可以通过页面大小计算出正确的xy坐标，但似乎有点麻烦，不知道这个问题是否常见

确实不太常见，但是如果在拿文本的坐标之前拿到rotation 参数，这个坐标应该比较好纠正，后面我可以尝试修正一下。