run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.01k stars 288 forks source link

LlamaParse messes up ordering of two-column PDFs #146

Open qniksefat opened 6 months ago

qniksefat commented 6 months ago

Hey,

I'm having a hard time parsing pdf files with two vertical columns filled with text. It actually sometimes captures the right order, but often does not. I'm parsing it into markdown.

For example, it parses one sentence from left and one from the right column. It does not break it between the sentence.

Thanks!

ah3243 commented 6 months ago

yep me too, parsing academic documents is really unreliable with any parser currently. If you're trying to use it with academic documents as well then many conferences also have a html format which if you use instead is straight forward to use as an input.

PowerOwner commented 6 months ago

10-K 2023, 09.30.2023-2023-11-02-08-16.json

10-K 2023, 09.30.2023-2023-11-02-08-16.pdf

image

There is a company that can solve multiple columns of vertical text, and it works particularly well on tables. And his speed is particularly fast, 100 pages <= 5s processing completed

nam-ruto commented 1 month ago

@PowerOwner what is this tool, btw?