Issue with PDF page boundaries

run-llama / llama_parse

Parse files for optimal RAG

https://www.llamaindex.ai

MIT License

1.79k stars 157 forks source link

Issue with PDF page boundaries #230

Open krzischp opened 3 weeks ago

krzischp commented 3 weeks ago

Context

Hi, I'm parsing a PDF using version 0.4.4 of llama-parse and using gpt-4o:

max_timeout = 6000
num_workers = 4
check_interval = 10

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="pt",
    num_workers=num_workers,
    gpt4o_mode=gpt4o_mode,
    gpt4o_api_key=os.getenv('OPENAI_API_KEY'),
    max_timeout=max_timeout,
    check_interval=check_interval
)

I cannot share the PDF as it contains confidential information. It is a financial contract of 45 pages in Portuguese.

I implemented a utility function to consider the default delimiter "\n---\n" to separate the llama-parse parsed resulting document in pages.

Issue

The resulting splits doesn't match the right pages.

I don't know how you delimit the pdf pages in llamaparse functions, but it seems like a block of text in a smaller square (a stamp at the end of each page of the contract in my case) is identified as a single page.

So for example, a pdf page with this stamp at the end will be split in 2 pages instead of one.

hexapode commented 3 weeks ago

Hi!

It may be a bug on our part :(
It may be that the stamp are in actual separate page in the PDF.
It may be that the delimiter \n---\n is actually return by the parser (therefor accidentally creating an new page). For this one you may want to set the page_separator attribute to an new value, unlikely to appear in the PDF.

krzischp commented 2 weeks ago

Hi @hexapode, thanks for your answer!

It may be that the stamp are in actual separate page in the PDF.

So the stamp is on the same page.

It may be that the delimiter \n---\n is actually return by the parser (therefor accidentally creating an new page). For this one you may want to set the page_separator attribute to an new value, unlikely to appear in the PDF

After installing the main branch version of llamaparse, I tried with another delimiter and split_by_page True (and with gpt4-o), but parsed result pages still doesn't match the right pdf pages. For example, this time, the 2nd, 3rd, 4th pdf pages all are in the 2nd parsed page result (documents[1])...

It may be a bug on our part

And do you need me to do some tests, troubleshooting on my side and send you the results?

krzischp commented 2 weeks ago

When using llama-index PDFReader, for example, the pages match correctly, but the quality is lower obviously.

hexapode commented 2 weeks ago

Hi!

Ideally could you share a similar document creating the same issue?

Alternatively, I will have a look at how the logic of split_by_page is implemented

krzischp commented 2 weeks ago

Ideally could you share a similar document creating the same issue?

All these documents are confidential unfortunately...

Alternatively, I will have a look at how the logic of split_by_page is implemented

Thanks!

krzischp commented 1 week ago

Hi @hexapode, this contract example if public: https://www.bndes.gov.br/arquivos/contratos-exportacao/2013.0237.pdf

and it seems it also parsed into more pages than the actual document, when using gpt-4o. It parsed into 48 pages when the document is actually 41 pages.

But when using llamaparse without gpt-4o (in text or markdown mode), it parses into the 41 pages correctly. So it might be an issue related to a gpt-4o post-processing step of the llama parser.

krzischp commented 1 day ago

HI @hexapode, do you have any update? I observed the issue is with or witout gpt-4o actually. It really seems that noises like stamp, signature, etc on a contract pdf can confuse the parser with the page number.