opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
13.8k stars 1.03k forks source link

希望能保留下划线等占位符,希望能保留 #638

Closed jeremyWangJun03 closed 1 month ago

jeremyWangJun03 commented 1 month ago

Description of the bug | 错误描述

原文:

image

识别:

image

希望能够保留这些占位符号

How to reproduce the bug | 如何复现

用带有__的pdf就会复现

Operating system | 操作系统

MacOS

Python version | Python 版本

3.12

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cpu

myhloli commented 1 month ago

扫描版文档的下划线受ocr功能限制很难识别出来