opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.25k stars 841 forks source link

带行号的PDF无法去除行号 #382

Closed CrabTY closed 1 month ago

CrabTY commented 1 month ago

Description of the bug | 错误描述

PDF文件左侧带有行号,处理结果中行号与正文混合在一起,希望能够去除行号

How to reproduce the bug | 如何复现

使用带行号的pdf,如https://media.neurips.cc/Conferences/NeurIPS2023/Styles/neurips_2023.pdf

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

drunkpig commented 1 month ago

@CrabTY The extraction of content from PDFs follows objective principles; in this example, you will need to handle the line numbers yourself.