Summary

We are investigating using Llama Parse with GPT-4o enabled. We observed that occasionally a non-blank page's contents were replaced by the string NO_CONTENT_HERE in the final result. Note that pages were not missing. If we sent in 10 pages, 10 were returned, but some would have been treated as if they were blank.

After some testing, we observed that this tended to occur with more complex pages such as those with large tables. One can recreate the issue with a page dense with random text. See below.

There are two issues here:

Regardless of cause, we need to have a way to know if this occurred. Pages of information are silently lost. Text NO_CONTENT_HERE is also used for blank pages, so without going through each PDF, we cannot know if the page was blank or there was an error. A simple way to implement this would be text PARSE_FAILED.
There seems to be a limit on the amount of stuff per page that Llama Parse can handle.

Example

I parsed desk.pdf with the web UI.

Screenshot 2024-06-20 at 8 00 12 PM

I downloaded the images of each page, and confirmed that the text was present.

Side Note

One can see a similar loss of information in your own examples. Please refer to this notebook. Click this link to see the slides. Note that the slide 9 is missing. CodeImage is the end of slide 8 and Data Ingestion is the top of slide 10.

print(str(response) )Codelmage
---
NO_CONTENT_HERE
---
|Data Ingestion / Parsing|Data Querying|
|---|---|
|Chunk| |

run-llama / llama_parse

Complex Pages Sometimes Lost When GPT-4o Enabled #242

Summary

Example

Side Note