run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.26k stars 318 forks source link

Formulas are not parsed correctly #81

Open Starforged opened 8 months ago

Starforged commented 8 months ago

When formulas are parsed some characters like the square root √ are deleted. Character that should be lowered ₐ as well as raised ² characters are not correctly positioned.

The input: X4gv1P60KMGXes1n

This is the raw text from the PDF (copy and paste): [H3O +] Ka 2

The output: [H3O+] = – K2a + K4a2 + Ka ca

The expected output: [H₃O⁺] = -Kₐ/2 + √{Kₐ²/4+Kₐcₐ}

The expected output was made manually by me and contains only unicode characters and no markdown or formatting information. For example: √ (U+221A) and ₐ (U+0061)

Starforged commented 8 months ago

chemiebuch_165.pdf Here is the relevant PDF page, for reproducing the error.

hexapode commented 8 months ago

Thanks for sharing the document.

It seems that we still have font issue on Math characters (���) will take a look at it.

Starforged commented 8 months ago

@hexapode Hello, thanks for taking a look. I did some more research to educate myself on how this problem should be solved correctly and it is apparently usually solved with OCR.

The expected output that I previously made manually is wrong in the sense that it uses nonstandard "math characters" and is not the generally agreed upon method of representing formulas.

I was able to find a project using pix2tex and PyTorch by @lukas-blecher that does the conversion correctly. https://github.com/lukas-blecher/LaTeX-OCR (Available under MIT license)

Another exiting developement is that 5 days ago @chaodreaming puplished a Latex formula recognition model trained on 110 million datasets on their repo! https://github.com/chaodreaming/Simple-LaTeX-OCR (Available under Apache-2.0 license)

I would recommend implementing this OCR based conversion to LaTeX format for formulas in LlamaParse.

An example conversion from LaTeX-OCR:

The input: grafik

The output: S=\int{x}\left{\frac{1}{2}\sum{a}\partial^{\mu}\chi{a}\partial{\mu}\chi _{a}+V(\rho)\right},