run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Parsing complex pdfs and outcomes #229

Open nyckoo opened 3 weeks ago

nyckoo commented 3 weeks ago

Hey there,

This thread is gonna show my expierence with LlamaParse method with and without gpt-4o argument set.

Short context: I'm trying to build RAG agent to complete user queries for natural medicines. For now I'm parsing pdf to markdown and results are far from expected:

Let me provide source pdf: Encyclopedia of Herbal Medicine_part_4_p1-2_test_other_medicinal_plants.pdf The markdown output with gpt-4o option: Encyclopedia of Herbal Medicine_part_4_p1-2_gpt4o_other_medicinal_plants.md The markdown output without gpt-4o: Encyclopedia of Herbal Medicine_part_4_p1-2_test_other_medicinal_plants.md

Also the instruction for the parser:

The document is a collection of natural medicines knowledge. I want it to be parsed into markdown format by following this instruction: Omit botanical name in green rectangle at the top middle of a page. Read content from three columns; start from the left one and read to the bottom, then go to another column on the right. Some pages can start from the middle of text block from previous page so include it as it is. Ignore images, but include their captions. Save text blocks' headers with single hash mark and paragraphs' names with double hash mark. Don't include any HTML and CSS.

I'm sharing this because I hope it gets better and we eventually get a product that meets our needs. If AI is there to get a job done it at least should follow our definite instruction and not just interpret stuff for us. Please analyze for your own conclusions if you want.