Problem with multiple columns in simple text

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

302 stars 57 forks source link

Problem with multiple columns in simple text #135

Closed pascucg closed 2 weeks ago

pascucg commented 2 weeks ago

Hello, I see that in some cases the columns are not processed correctly. It jumps from information in one column to another, causing the resulting information to be out of order and incorrect.

I provide you with an example pdf where it occurs: ejemplo.pdf

I also show you the problem below: column_error_example

If I process the pdf with pymupdf, it does it correctly: column_correct_example

JorjMcKie commented 2 weeks ago

I found the problem: The joining of original text blocks happens too aggressively, so the page number at the bottom gets joined and recursively causes all the text on page being joined in one big single block. This causes nonsense to come out in the end.

As a quick fix, you can use margins=(0, 0, 0, 72) to ignore the page number block.

JorjMcKie commented 2 weeks ago

Fixed in version 0.0.15.

pascucg commented 2 weeks ago

Hello @JorjMcKie ,

Thank you for the fix, I have tested it in version 0.0.16 and it works correctly.

Running tests I see another problem that I indicate below.

In some documents the first line of the page is omitted.

I show you an example below. ejemplo.pdf

ejemplo

JorjMcKie commented 2 weeks ago

No, this works for margins=0:

pascucg commented 2 weeks ago

You are right, using: doc = pymupdf4llm.to_markdown('ejemplo.pdf', page_chunks=True, margins=0)

It works correctly.

Thank you