pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

to_markdown() for two-column pdf #8

Closed MahtabF closed 4 months ago

MahtabF commented 4 months ago

Thanks for this useful library! The pymupdf4llm.to_markdown() is doing a great job at properly reading the complex table formats. However, I noticed that when I use this on a double column pdf file, it mixes the paragraphs from the left side to the right side.

Also, I don't see this problem when I simple open a pdf file with fitz library. So, somewhere after reading the pdf file this problem arises!

MahtabF commented 4 months ago

Well actually in the write_text function within the pymupdf4llm.to_markdown(), I set the sort False and it is working properly:

blocks = page.get_text(
            "dict",
            clip=clip,
            flags=fitz.TEXTFLAGS_TEXT,
            sort=False,
        )["blocks"]

What is the use case of setting the sort to True?

JorjMcKie commented 4 months ago

There is no way to predict the sequence in which the text objects are physically stored in a page's appearance source code (yes: it is a special language, similar to Postscript).

The PDF creator could have been mean and have used an arbitrary permutation out of the N! alternatives to arrange the N characters appearing on a page. Not frequent - but it happens!

For an example compare plain text extraction of the following two equal looking PDFs file 1, file 2.

A lot more often are page headers and footers added to already existing pages. Or think of tables: their cells may have been filled in some sequence determined by when the appropriate content became available.

Or look at the following diagram. Let's assume we have correctly identified the table and now want to bring the 4 blocks in some sequence: which is the right one: block1, block2, block3, block4? Or rather block1, block3, block2, block4?

+-----------+   +------------+
|   block1  |   |   block2   |
+-----------+   +------------+

+----------------------------+
|       table                |
+----------------------------+

+-----------+   +------------+
|   block3  |   |   block4   |
+-----------+   +------------+

Here is another example from a science magazine page, apparently having 3 columns (really!?). I have extracted the text blocks and wrapped them by red rectangles. The sequence in which they are stored is written at each block's top-left point.

image

Not sorting the blocks will deliver the nonsense sequence 0, 1, 2, ... 10. Sorting them will still not be good, but at least somewhat better:

image

We probably can agree that a solution to this problem is not simple.

JorjMcKie commented 4 months ago

I have to correct myself. The latest version 0.0.2 solves the above problem. It automatically identies the page columns and processing them as follows. image