Open Starforged opened 8 months ago
chemiebuch_165.pdf Here is the relevant PDF page, for reproducing the error.
Thanks for sharing the document.
It seems that we still have font issue on Math characters (���) will take a look at it.
@hexapode Hello, thanks for taking a look. I did some more research to educate myself on how this problem should be solved correctly and it is apparently usually solved with OCR.
The expected output that I previously made manually is wrong in the sense that it uses nonstandard "math characters" and is not the generally agreed upon method of representing formulas.
I was able to find a project using pix2tex and PyTorch by @lukas-blecher that does the conversion correctly. https://github.com/lukas-blecher/LaTeX-OCR (Available under MIT license)
Another exiting developement is that 5 days ago @chaodreaming puplished a Latex formula recognition model trained on 110 million datasets on their repo! https://github.com/chaodreaming/Simple-LaTeX-OCR (Available under Apache-2.0 license)
I would recommend implementing this OCR based conversion to LaTeX format for formulas in LlamaParse.
An example conversion from LaTeX-OCR:
The input:
The output: S=\int{x}\left{\frac{1}{2}\sum{a}\partial^{\mu}\chi{a}\partial{\mu}\chi _{a}+V(\rho)\right},
When formulas are parsed some characters like the square root √ are deleted. Character that should be lowered ₐ as well as raised ² characters are not correctly positioned.
The input:
This is the raw text from the PDF (copy and paste): [H3O +] Ka 2
The output: [H3O+] = – K2a + K4a2 + Ka ca
The expected output: [H₃O⁺] = -Kₐ/2 + √{Kₐ²/4+Kₐcₐ}
The expected output was made manually by me and contains only unicode characters and no markdown or formatting information. For example: √ (U+221A) and ₐ (U+0061)