opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
13.33k stars 994 forks source link

文字的上标和下标信息丢失 #219

Closed yiyibooks closed 3 months ago

yiyibooks commented 3 months ago

Description of the bug | 错误描述

论文的作者信息部分通常会有大量的上标数字,如下图 image

MinerU 解析后的 markdown 文本如下,丢失了上标信息

Aryo Pradipta Gema 1 Joshua Ong Jun Leang 1 Giwon $\mathbf{H o n g^{1}}$ Alessio Devoto 2 Alberto Carlo Maria MancinoRohit Saxena1Xuanli$\mathbf{H}\mathbf{e}^{4}$Yu Zhao1Xiaotang Du1Mohammad Reza Ghasemi Madani 5 Claire Barale 1 Robert McHardy 6 Joshua Harris 7 Jean Kaddour 4 Emile van Krieken 1 Pasquale Minervini 1

1 University of Edinburgh 2 Sapienza University of Rome 3 Polytechnic University of Bari 4 University College London 5 University of Trento 6 AssemblyAI 7 UK Health Security Agency {first.last, jong2, p.minervini}@ed.ac.uk alessio.devoto@uniroma1.it alberto.mancino@poliba.it mr.ghasemimadani@unitn.it joshua.harris@ukhsa.gov.uk {xuanli.he, jean.kaddour.20, robert.mchardy.20}@ucl.ac.uk

How to reproduce the bug | 如何复现

示例论文 https://arxiv.org/pdf/2406.04127 基本上所有论文都会有这个问题

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli commented 3 months ago

上下标是被当做影响上下文流畅性的因素人为删除掉了

drunkpig commented 3 months ago

@yiyibooks The superscript citations in the paper have been deliberately removed, thinking that the superscripts affect the readability.

yiyibooks commented 3 months ago

谢谢 @myhloli @drunkpig !

可不可以增加一个选项保留这些文字的元信息,还有其他如粗体、斜体等信息。这些信息在渲染 markdown 给人阅读时非常有用呢

drunkpig commented 3 months ago

@yiyibooks We cannot extract information such as text color and bold formatting from scanned PDFs, but we can obtain this information from text-based PDFs. This work deviates somewhat from our current main focus, so we will not be supporting the development of this feature in the near future.

yiyibooks commented 3 months ago

Get. 期待咱们能早日实现 ~