opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
16.59k stars 1.2k forks source link

行内公式出现KeTeX无法解析的内容,并且相同的内容有时正确有时错误 #385

Open hahhforest opened 3 months ago

hahhforest commented 3 months ago

Description of the bug | 错误描述

在解析包含行内公式的pdf文档时,渲染解析结果.md时经常出现“ParseError: KaTeX parse error”问题

以下图中的文档为例(完整pdf文件在“复现”中上传)

Clipboard_Screenshot_1723196862

解析结果:

WiedurchBirger[4]gefundenwurdeunddurcheigeneArbeiten aufdemGebietderAlkyl  $\cdot\pmb{\sigma}$  Vanadium-Systemebestatigtwurde [3],befindet sich derBereich der  $\pmb{\nu}(\mathbf{M}^{\ddot{\mathbf{u}}}\!-\!\dot{\mathbf{C}})$  vonUbergangsmetall- Alkyl-Verbindungenzwischenetwa  $400\;\mathrm{cm}^{-1}$  und  $\dot{600}\;\mathrm{cm}^{-1}$  ObigeErgebnissesind somitoffensichtlichsozuinterpretieren,daB imS pek tr umA died rei starkenbissehr starke nBa nden bei  $\mathrm{432\;cm^{-1}}$   $\mathbf{484\;cm^{-1}}$  und  ${\bf568\;cm^{-1}}$  den  ${\pmb v}(\mathrm{Fe}\!-\!\mathrm{C})$  des Bis(butandiyl)- ferrat(II)-Systems,im S pek tr umB died reise hr starkenBanden bei  $480~\mathrm{cm^{-1}}$   ${\bf540~cm^{-1}}$  und  ${\bf592\cm^{-1}}$  den  ${\pmb v}(\mathrm{Fe}\!-\!0)$  desdurch LufteinwirkungausIIIentstandenenOxidationsprodukteszu- zuordnen sind.

对于 cm^-1这个格式,出现了下列几种解析结果:

  1. ‘$400\;\mathrm{cm}^{-1}$’
  2. ‘$\mathrm{432\;cm^{-1}}$’
  3. ‘$\mathbf{484\;cm^{-1}}$’
  4. ‘${\bf568\;cm^{-1}}$’
  5. ‘$480~\mathrm{cm^{-1}}$’
  6. ‘${\bf592\cm^{-1}}$’

其中‘${\bf592\cm^{-1}}$’这一种解析结果渲染时报错:"ParseError: KaTeX parse error: Undefined control sequence: \cm at position 8: {\bf592\̲c̲m̲^{-1}}"。而使用Mathpix解析时格式统一为'$600 \mathrm{~cm}^{-1}$'

另外分析该文档解析结果还发现出现了未识别到空格的问题

Clipboard_Screenshot_1723197459

How to reproduce the bug | 如何复现

文件: origin.pdf 软件版本: pip install -e .[full]

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

hahhforest commented 3 months ago

还遇到过很多种其他的"ParseError: KaTeX parse error:...",想知道可能是什么问题导致的

drunkpig commented 3 months ago

@hahhforest There are multiple Latex rendering engines available, and you are currently using the KaTeX engine. In fact, there are several other rendering engines as well. We will ensure that the Latex for mathematical formulas is correct. However, we cannot yet guarantee that the output will be uniformly formatted using the same Latex rendering engine syntax.

xiabo0816 commented 3 months ago

非常喜欢表格识别这个功能!但是也遇到了同样ketex的问题

LymanY commented 1 week ago

+1