We are investigating using Llama Parse with GPT-4o enabled. We observed that occasionally a non-blank page's contents were replaced by the string NO_CONTENT_HERE in the final result. Note that pages were not missing. If we sent in 10 pages, 10 were returned, but some would have been treated as if they were blank.
After some testing, we observed that this tended to occur with more complex pages such as those with large tables. One can recreate the issue with a page dense with random text. See below.
There are two issues here:
Regardless of cause, we need to have a way to know if this occurred. Pages of information are silently lost. Text NO_CONTENT_HERE is also used for blank pages, so without going through each PDF, we cannot know if the page was blank or there was an error. A simple way to implement this would be text PARSE_FAILED.
There seems to be a limit on the amount of stuff per page that Llama Parse can handle.
I downloaded the images of each page, and confirmed that the text was present.
Side Note
One can see a similar loss of information in your own examples. Please refer to this notebook. Click this link to see the slides. Note that the slide 9 is missing. CodeImage is the end of slide 8 and Data Ingestion is the top of slide 10.
NO_CONTENT_HERE mean in this case that GPT4o was not able to detect the text of the page. Will add your document in our test cases, and try to make it work.
Summary
We are investigating using Llama Parse with GPT-4o enabled. We observed that occasionally a non-blank page's contents were replaced by the string
NO_CONTENT_HERE
in the final result. Note that pages were not missing. If we sent in 10 pages, 10 were returned, but some would have been treated as if they were blank.After some testing, we observed that this tended to occur with more complex pages such as those with large tables. One can recreate the issue with a page dense with random text. See below.
There are two issues here:
NO_CONTENT_HERE
is also used for blank pages, so without going through each PDF, we cannot know if the page was blank or there was an error. A simple way to implement this would be textPARSE_FAILED
.Example
I parsed desk.pdf with the web UI.
I downloaded the images of each page, and confirmed that the text was present.
Side Note
One can see a similar loss of information in your own examples. Please refer to this notebook. Click this link to see the slides. Note that the slide 9 is missing. CodeImage is the end of slide 8 and Data Ingestion is the top of slide 10.