opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

text line missing #492

Closed drunkpig closed 2 weeks ago

drunkpig commented 2 weeks ago

Description of the bug | 错误描述

f026269ae8f00cc3817b51d4c2264c4

some text line is missing.

bb.pdf

How to reproduce the bug | 如何复现

use magic-pdf cli

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

myhloli commented 2 weeks ago
image

模型识别到的textblock有重叠,在后续处理时,通过避让规则两个框各回缩了一部分,导致上面一条过窄,后续文本的span无法fill进缩窄的block中。