pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.6k stars 523 forks source link

Allow table extraction to handle merged cells #4030

Open leorouxx opened 6 days ago

leorouxx commented 6 days ago

My Problem I mainly use the pymupdf4llm framework, but I believe the root problem comes from how table extraction is performed in pymupdf. I have pdfs with tables that contains (horizontal and or vertical) merged cells. see a basic example

for the example above (considering it is written as text in a pdf), the typical output would be : Class Student Test Grades
A 1 96
87
2 83
85

There is thus a loss in information for the lines that do not contain the information in the merged cell.

The request It would be nice to have the ability (maybe as a table extraction strategy ?) to retrieve somehow the information of the cell organization of a table.

Existing Alternatives A paid and closed source alternative is Azure Document Intelligence (with the layout extraction). the output of an extracted table is a formatted JSON with each cell containing its coordinates and the span it occupies in the table. For example, the cell containing the "A" would be (partially) represented as: { "coordinates" : [1,0], "content" : "A", "row_span" : 4 }

Even with such raw data it would be possible to recreate the actual structure of the table.

JorjMcKie commented 4 days ago

Specifically for Markdown output (PyMuPDF4LLM, respectively Table.to_markdown()), we cannot map complex table structures: this is not supported by the markdown syntax itself. And we will certainly not pollute MD output by intermixing HTML syntax into it (even though MD renderers may be supporting this).

Whether or not we one day may support HTML or JSON table output in PyMuPDF (only) remains to be decided. Currently, there are no such plans.

All we could do in MD potentially is duplicating the values column-, respectively row-wise. I.e. when being given a table like this one: image ... then instead of returning this:

|Class|Student|Test Grades|
|---|---|---|
|A|1|95|
|||93|
||2|90|
|||85|
|B|Two cells||

We may consider returning the following:

|Class|Student|Test Grades|
|---|---|---|
|A|1|95|
|A|1|93|
|A|2|90|
|A|2|85|
|B|Two cells|TWO cells|

... or rendered: image

However:

This is error-prone, respectively ambiguous: The original data extraction looks like this:

['Class', 'Student', 'Test Grades']
['A', '1', '95']
[None, None, '93']
[None, '2', '90']
[None, None, '85']
['B', 'Two cells', None]

While data replications top-down in columns 1 and 2 seem to be obvious / inevitable (non-present cells are represented by None), the bottom-right None is not that clear: should it receive the value "Two columns" or "85"?

leorouxx commented 2 days ago

Hello, thank you for your answer.

To be clear, my suggestion is not about implementing a little tweak to partially recover the lost info on markdown. As you mentioned it, markdown presents some limitations that would get this process ambiguous / wrong. I believe the core of this problem has to do with the table extraction algorithm in the first place. If an improvement is to be made, I feel it should be at this level.

JorjMcKie commented 1 day ago

Ok, got you. I have one correction for your wording if you don't mind: The table finder does not "forget" row or column spanning information. This would imply that that sort of information is actually there and just wasn't discovered.

This is not the case! PDF tables are no tables - meaning, what we see and interpret as a table (lines, text, matching left or right borders, shaded backgrounds, what have you) in PDF is literally nothing but an amorphous collection of characters and vector graphic atoms. So the table finder does not actually locate the table, but synthesizes the mentioned components and tries to impose "meaning" or structure that leads to a successful interpretation as a table. Therefore, column or row spans would also have to be the outcome of such an interpretation. Again, nothing that can be "found". So the None values in certain cells is all that we will ever have: it won't get better than that.