run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.23k stars 316 forks source link

The reader self creates data out of thin air? #420

Open kinshukkaura opened 1 month ago

kinshukkaura commented 1 month ago

Describe the bug The reader creates data out of thin air for multiple pages. It creates tables with information that is not available anywhere in the pdf page. .

Files ppfas-mf-factsheet-for-August-2024.pdf

Job ID 7f63cc55-1a75-450d-aea0-3a6aa3c648ba

Screenshots image image

Client:

Options Using the accurate method, with all other fields default/empty.

Additional context Add any additional context about the problem here.

hexapode commented 1 month ago

Had a look at your job and you used our default mode (Accurate). This document work well with premium mode (see attached markdown, with the exception of a miss-classified chart as an image). ppfas-mf-factsheet-for-August-2024.pdf.md

However the premium mode is more expensive as more compute is involved. Alternatively you can try to use our fast mode that will layout the text in an understandable way for LLM (but not extract the tables)

kinshukkaura commented 1 month ago

Thanks. I believe I would have to use the premium mode. is there any reason why the model hallucinates in the default mode (Accurate) and not in other modes? Could playing with the parsing instructions (prompt) help in any way?