run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.41k stars 245 forks source link

Text converted to curves ignored completely in the output #296

Open invakid404 opened 1 month ago

invakid404 commented 1 month ago

While evaluating whether llama_parse would work for our use case, I noticed that llama_index appeared to ignore a large portion of the text in the test document I used.

When I opened said document using LibreOffice Draw, I noticed that the text was in the form of Bézier curves.

Here is a minimal reproducible example: exhibit_a.pdf exhibit_b.pdf

exhibit_a.pdf contains two regular text frames, and exhibit_b.pdf contains one text frame converted to Bézier curves, and one regular text frame.

Here is the output for exhibit_a.pdf: image

Here is the output for exhibit_b.pdf: image

BinaryBrain commented 1 month ago

Thanks for providing examples. We'll have a look at it!