Closed yiyibooks closed 3 months ago
上下标是被当做影响上下文流畅性的因素人为删除掉了
@yiyibooks The superscript citations in the paper have been deliberately removed, thinking that the superscripts affect the readability.
谢谢 @myhloli @drunkpig !
可不可以增加一个选项保留这些文字的元信息,还有其他如粗体、斜体等信息。这些信息在渲染 markdown 给人阅读时非常有用呢
@yiyibooks We cannot extract information such as text color and bold formatting from scanned PDFs, but we can obtain this information from text-based PDFs. This work deviates somewhat from our current main focus, so we will not be supporting the development of this feature in the near future.
Get. 期待咱们能早日实现 ~
Description of the bug | 错误描述
论文的作者信息部分通常会有大量的上标数字,如下图
MinerU 解析后的 markdown 文本如下,丢失了上标信息
Aryo Pradipta Gema 1 Joshua Ong Jun Leang 1 Giwon $\mathbf{H o n g^{1}}$ Alessio Devoto 2 Alberto Carlo Maria MancinoRohit Saxena1Xuanli$\mathbf{H}\mathbf{e}^{4}$Yu Zhao1Xiaotang Du1Mohammad Reza Ghasemi Madani 5 Claire Barale 1 Robert McHardy 6 Joshua Harris 7 Jean Kaddour 4 Emile van Krieken 1 Pasquale Minervini 1
1 University of Edinburgh 2 Sapienza University of Rome 3 Polytechnic University of Bari 4 University College London 5 University of Trento 6 AssemblyAI 7 UK Health Security Agency {first.last, jong2, p.minervini}@ed.ac.uk alessio.devoto@uniroma1.it alberto.mancino@poliba.it mr.ghasemimadani@unitn.it joshua.harris@ukhsa.gov.uk {xuanli.he, jean.kaddour.20, robert.mchardy.20}@ucl.ac.uk
How to reproduce the bug | 如何复现
示例论文 https://arxiv.org/pdf/2406.04127 基本上所有论文都会有这个问题
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cuda