Open krzischp opened 3 weeks ago
Hi!
Hi @hexapode, thanks for your answer!
It may be that the stamp are in actual separate page in the PDF.
So the stamp is on the same page.
It may be that the delimiter \n---\n is actually return by the parser (therefor accidentally creating an new page). For this one you may want to set the page_separator attribute to an new value, unlikely to appear in the PDF
After installing the main branch version of llamaparse, I tried with another delimiter and split_by_page
True (and with gpt4-o), but parsed result pages still doesn't match the right pdf pages. For example, this time, the 2nd, 3rd, 4th pdf pages all are in the 2nd parsed page result (documents[1]
)...
It may be a bug on our part
And do you need me to do some tests, troubleshooting on my side and send you the results?
When using llama-index PDFReader
, for example, the pages match correctly, but the quality is lower obviously.
Hi!
Ideally could you share a similar document creating the same issue?
Alternatively, I will have a look at how the logic of split_by_page is implemented
Ideally could you share a similar document creating the same issue?
All these documents are confidential unfortunately...
Alternatively, I will have a look at how the logic of split_by_page is implemented
Thanks!
Hi @hexapode, this contract example if public: https://www.bndes.gov.br/arquivos/contratos-exportacao/2013.0237.pdf
and it seems it also parsed into more pages than the actual document, when using gpt-4o. It parsed into 48 pages when the document is actually 41 pages.
But when using llamaparse without gpt-4o (in text or markdown mode), it parses into the 41 pages correctly. So it might be an issue related to a gpt-4o post-processing step of the llama parser.
HI @hexapode, do you have any update? I observed the issue is with or witout gpt-4o actually. It really seems that noises like stamp, signature, etc on a contract pdf can confuse the parser with the page number.
Context
Hi, I'm parsing a PDF using version 0.4.4 of llama-parse and using gpt-4o:
I cannot share the PDF as it contains confidential information. It is a financial contract of 45 pages in Portuguese.
I implemented a utility function to consider the default delimiter "\n---\n" to separate the llama-parse parsed resulting document in pages.
Issue
The resulting splits doesn't match the right pages.
I don't know how you delimit the pdf pages in llamaparse functions, but it seems like a block of text in a smaller square (a stamp at the end of each page of the contract in my case) is identified as a single page.
So for example, a pdf page with this stamp at the end will be split in 2 pages instead of one.