Unable to parse double column pdf

smallzhao commented 5 months ago

0B3168BDCDA63212BC25EDF6681AE1EF.pdf src_pdf:

dst_md:

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("/Users/zhaowenhua/Downloads/期刊/pdf/0B3168BDCDA63212BC25EDF6681AE1EF.pdf", page_chunks=True)

import pathlib
pathlib.Path("/Users/zhaowenhua/Downloads/期刊/pdf/output.md").write_bytes(md_text.encode())

I use pymupdf4llm==0.0.5, and I cannot separate the two columns of the PDF. The above is the code and file I used, as well as the generated results. Do I need to configure other parameters to achieve the effect of separating the two columns?

JorjMcKie commented 5 months ago

Your PDF is created unusually in that almost every line is a separate text block. This confuses the column identification algorithm currently.

I have developed a fix which will be published with the next version.

JorjMcKie commented 4 months ago

Partly solved in version 0.0.6.

The solution solves some problems but it is like with table recognition: There will always be cases that escape a complete detection.

pymupdf / RAG

Unable to parse double column pdf #40