opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

Point form always failed #481

Open michaelthwan opened 2 weeks ago

michaelthwan commented 2 weeks ago

Description of the bug | 错误描述

Point form cannot be recognized as a proper point form in md.

image

How to reproduce the bug | 如何复现

Sample pdf (one page) for you. transformer_p006.pdf

I am using auto mode. magic-pdf -p XXX.pdf

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

michaelthwan commented 2 weeks ago

Span: image

I think it is because the red boxes are separated and registered as different span.