run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.45k stars 250 forks source link

Poor performance on scanned PDFs with improperly rotated content #327

Open invakid404 opened 1 month ago

invakid404 commented 1 month ago

Describe the bug I noticed that when dealing with a scanned PDF with improperly rotated content, LlamaParse consistently gets certain details like numbers wrong, sometimes swapping out certain digits for others, repeating digits, etc.

It is consistently reproducible with caching disabled and rotating the PDF has a noticeable performance improvement.

I managed to make a reproduction by generating 50 random numbers, putting them in a PDF, converting them to an image, and then comparing the output when the PDF is correctly oriented and when it is not. In the first scenario, all numbers in the output are correct. In the second scenario, it outputs the correct amount of numbers, and some of them are incorrect.

Files numbers_normal.pdf numbers_rotated.pdf The original numbers

Job ID If you have it, please provide the ID of the job you ran. You can find it here: https://cloud.llamaindex.ai/parse in the "History" tab.

Screenshots Feel free to also provide screenshots if relevant.

Client: Please remove untested options:

Options Multimodal with Claude 3.5 Sonnet

Additional context I did see #32 before opening this issue, but I thought that my case was different enough for this issue to not be considered a duplicate. I also have specific reproduction steps which I thought are worth sharing.

hexapode commented 1 month ago

thanks for repporting, we need to improve on rotated content.