带行号的PDF无法去除行号

opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取。

https://opendatalab.com/OpenSourceTools

GNU Affero General Public License v3.0

11.25k stars 841 forks source link

Closed CrabTY closed 1 month ago

CrabTY commented 1 month ago

PDF文件左侧带有行号，处理结果中行号与正文混合在一起，希望能够去除行号

使用带行号的pdf，如https://media.neurips.cc/Conferences/NeurIPS2023/Styles/neurips_2023.pdf

Linux

3.10

0.6.x

cuda

drunkpig commented 1 month ago

@CrabTY The extraction of content from PDFs follows objective principles; in this example, you will need to handle the line numbers yourself.