pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Unable to parse 2-column documents #41

Closed rahul-dhir-0047 closed 4 months ago

rahul-dhir-0047 commented 5 months ago

Here is the document preview that I want to pass. Screenshot 2024-06-13 105053

Here is the parsed markdown format that i am getting (same for Llama Docs) Screenshot 2024-06-13 105633

Facing the same issue.

JorjMcKie commented 5 months ago

Please always provide a reproducing document. An image is good for explaining things ... no more.

JorjMcKie commented 5 months ago

Duplicate of #40.

rahul-dhir-0047 commented 5 months ago

Please always provide a reproducing document. An image is good for explaining things ... no more.

Sorry I didnt attach it here. Please refer to the document below: Mediclassic-Individual-Insurance-Policy-v2.pdf

JorjMcKie commented 4 months ago

Partly solved in version 0.0.6.

The solution solves some problems but it is like with table recognition: There will always be cases that escape a complete detection.