Parsing complex pdfs and outcomes

Hey there,

This thread is gonna show my expierence with LlamaParse method with and without gpt-4o argument set.

Short context: I'm trying to build RAG agent to complete user queries for natural medicines. For now I'm parsing pdf to markdown and results are far from expected:

The senteces are sometimes alternated with different words with similar meaning, the sense is preserved but gets twisted a bit.
Despite clear instruction and several attempts to improve it, the text block starting from one page continued to another gets lost.
Headlines at the top are being read even though I instruct it not to do it.
Markdown formatting malfuntions from one page to another (you can see the plants and paragraphs on 2nd page marked differently)

Let me provide source pdf: Encyclopedia of Herbal Medicine_part_4_p1-2_test_other_medicinal_plants.pdf The markdown output with gpt-4o option: Encyclopedia of Herbal Medicine_part_4_p1-2_gpt4o_other_medicinal_plants.md The markdown output without gpt-4o: Encyclopedia of Herbal Medicine_part_4_p1-2_test_other_medicinal_plants.md

Also the instruction for the parser:

The document is a collection of natural medicines knowledge. I want it to be parsed into markdown format by following this instruction: Omit botanical name in green rectangle at the top middle of a page. Read content from three columns; start from the left one and read to the bottom, then go to another column on the right. Some pages can start from the middle of text block from previous page so include it as it is. Ignore images, but include their captions. Save text blocks' headers with single hash mark and paragraphs' names with double hash mark. Don't include any HTML and CSS.

I'm sharing this because I hope it gets better and we eventually get a product that meets our needs. If AI is there to get a job done it at least should follow our definite instruction and not just interpret stuff for us. Please analyze for your own conclusions if you want.

run-llama / llama_parse

Parsing complex pdfs and outcomes #229