run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Unable to parse big single page pdf to md #233

Open laleet-avaiya opened 3 weeks ago

laleet-avaiya commented 3 weeks ago

I have 8 MB pdf but it has only one page, and during parsing to md It's returning below error.

024-06-11 06:11:25,273 [INFO] HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/9ff80260-7506-4c09-81e2-2758f720882a "HTTP/1.1 200 OK"
2024-06-11 06:11:28,834 [INFO] HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/9ff80260-7506-4c09-81e2-2758f720882a "HTTP/1.1 200 OK"
2024-06-11 06:11:32,372 [INFO] HTTP Request: GET https://api.cloud.llamaindex.ai/api/parsing/job/9ff80260-7506-4c09-81e2-2758f720882a "HTTP/1.1 200 OK"
Error while parsing the file 'xxxxxxxxxxxxxxxxxxx': Failed to parse the file: 9ff80260-7506-4c09-81e2-2758f720882a, status: ERROR
2024-06-11 06:11:32,374 [WARNING] ⚠️ No content found for file: xxxxxxxxxxxxxxxxxxx

I am doing that using pythong and facing issue.

parser = LlamaParse(
            api_key=api_key,
            gpt4o_mode=True,
            result_type="markdown",
            num_workers=1,
            verbose=True,
            language="en",
        )

laleet_test_wp_image.pdf

hexapode commented 2 weeks ago

With GPT4o we send an Image of your page to the model. However gpt4o resize all image to 2048 max width and height while conserving the ratio. In your sample PDF as the page is very long it mean the image get resize to 2048px of height that result in an image where no text is legible.

It seems that you generated this PDF with imagemagick from a html file, try to send the html instead to llamaParse, and it will get splited into multiple 'pages' allowing GPT4o to do his job. This may work