pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Unable to parse double column pdf #40

Closed smallzhao closed 4 months ago

smallzhao commented 5 months ago

0B3168BDCDA63212BC25EDF6681AE1EF.pdf src_pdf:

image

dst_md:

image
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("/Users/zhaowenhua/Downloads/期刊/pdf/0B3168BDCDA63212BC25EDF6681AE1EF.pdf", page_chunks=True)

import pathlib
pathlib.Path("/Users/zhaowenhua/Downloads/期刊/pdf/output.md").write_bytes(md_text.encode())

I use pymupdf4llm==0.0.5, and I cannot separate the two columns of the PDF. The above is the code and file I used, as well as the generated results. Do I need to configure other parameters to achieve the effect of separating the two columns?

JorjMcKie commented 5 months ago

Your PDF is created unusually in that almost every line is a separate text block. This confuses the column identification algorithm currently.

I have developed a fix which will be published with the next version.

JorjMcKie commented 4 months ago

Partly solved in version 0.0.6.

The solution solves some problems but it is like with table recognition: There will always be cases that escape a complete detection.