run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Complex Pages Sometimes Lost When GPT-4o Enabled #242

Open adreichert opened 1 week ago

adreichert commented 1 week ago

Summary

We are investigating using Llama Parse with GPT-4o enabled. We observed that occasionally a non-blank page's contents were replaced by the string NO_CONTENT_HERE in the final result. Note that pages were not missing. If we sent in 10 pages, 10 were returned, but some would have been treated as if they were blank.

After some testing, we observed that this tended to occur with more complex pages such as those with large tables. One can recreate the issue with a page dense with random text. See below.

There are two issues here:

Example

I parsed desk.pdf with the web UI.

Screenshot 2024-06-20 at 8 00 12 PM

I downloaded the images of each page, and confirmed that the text was present.

Side Note

One can see a similar loss of information in your own examples. Please refer to this notebook. Click this link to see the slides. Note that the slide 9 is missing. CodeImage is the end of slide 8 and Data Ingestion is the top of slide 10.

print(str(response) )Codelmage
---
NO_CONTENT_HERE
---
|Data Ingestion / Parsing|Data Querying|
|---|---|
|Chunk| |
hexapode commented 1 week ago

NO_CONTENT_HERE mean in this case that GPT4o was not able to detect the text of the page. Will add your document in our test cases, and try to make it work.