jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

how to get colspan or rowspan info in the table? #927

Closed tujinshu closed 1 year ago

tujinshu commented 1 year ago

Describe the bug

test_span.pdf

企业微信截图_b154b9f9-2d3f-4d7e-9eea-570428f8fb15

the table row0 col (1,2,3,4) is combine into row 0 col 1,when extracted, col2 col3, col4, is None [['type', 'cost', None, None, None, 'cost', None, None, None], ['poc', 'before', None, None, None, 'after', None, None, None], [None, 'deploy', 'dev', 'test', 'support', 'deploy', 'dev', 'test', 'support'], [None, '100', '200', '50', '50', '30', '200', '40', '20'], ['uat', '200', '400', '555', '666', '201', '401', '557', '668'], ['prod', '300', '600', '700', '900', '301', '601', '701', '901']]

can pdfplumber output like html to show the relation

type cost  cost 
poc before after
deploy dev test support deploy dev test support

html code is:

<html>
<head>
</head>
<table border="1 " width="900">
    <tr>
        <td width="11% "> type </td>
        <td width="11% " colspan="4"> cost  </td>
        <td width="11% " colspan="4"> cost  </td>
    </tr>
    <tr>
        <td width="11% " rowspan="2" > poc </td>
        <td width="11% " colspan="4"> before </td>
        <td width="11% " colspan="4"> after </td>
    </tr>
    <tr>
        <td width="11% "> deploy </td>
        <td width="11% "> dev </td>
        <td width="11% "> test </td>
        <td width="11% "> support </td>
        <td width="11% "> deploy </td>
        <td width="11% "> dev </td>
        <td width="11% "> test </td>
        <td width="11% "> support </td>
    </tr>
</table>
</html>

Code to reproduce the problem

import pdfplumber
pdf = pdfplumber.open("./data/test_span.pdf")
p0 = pdf.pages[0]
p0 = p0.filter(keep_visible_lines)
im = p0.to_image()
im.debug_tablefinder()
table = p0.find_tables()
table[0].extract()

PDF file

test_span.pdf

Expected behavior

pdfplumber output like html to show the relation like the html

Actual behavior

the combined col filled None instead

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

cmdlineluser commented 1 year ago

I know that you can use .find_tables() to get the table objects.

You can look at their .rows, .cells, etc.

>>> table = page.find_tables()[0]
>>> rows = table.rows
>>> rows
[<pdfplumber.table.Row at 0x1437ab400>,
 <pdfplumber.table.Row at 0x1375c0a00>,
 <pdfplumber.table.Row at 0x137216590>,
 <pdfplumber.table.Row at 0x137214e50>,
 <pdfplumber.table.Row at 0x137214430>,
 <pdfplumber.table.Row at 0x137216200>]

im.reset().draw_rect(rows[1].cells[0], stroke_width=5)

image

im.reset().draw_rect(rows[1].cells[1], stroke_width=5)

image

# type rowspan
>>> (rows[0].cells[0][-1] - rows[0].cells[0][1]) / (rows[0].cells[1][-1] - rows[0].cells[1][1])
1.0
# poc rowspan
>>> (rows[1].cells[0][-1] - rows[1].cells[0][1]) / (rows[1].cells[1][-1] - rows[1].cells[1][1])
3.0103686635944618

Not sure if pdfplumber attempts to use this information or not.


Update: Perhaps you could do something like this:

pdf = pdfplumber.open("Downloads/test_span.pdf")
page = pdf.pages[0]

table = page.find_tables()[0]

# size of smallest col and row for reference
col_unit = min(int(cell[2] - cell[0]) for cell in table.cells if cell)
row_unit = min(int(cell[3] - cell[1]) for cell in table.cells if cell)

cells = {}
# Process in reverse order so we can modify
for row_nr in range(len(table.rows) - 1, -1, -1):
    row = table.rows[row_nr]

    for col_nr in range(len(row.cells) - 1, -1, -1):
        cell = row.cells[col_nr]

        text = None

        if cell is not None:
            colspan = int(cell[2] - cell[0]) // col_unit
            rowspan = int(cell[3] - cell[1]) // row_unit

            text = page.crop(cell).extract_text()

            # forward_fill column
            for new_col in range(colspan):
                cells[row_nr, col_nr + new_col] = text

            # forward_fill row
            for new_row in range(rowspan):
                cells[row_nr + new_row, col_nr] = text

        cells[row_nr, col_nr] = text

num_rows = range(len(table.rows))
num_cols = range(len(table.rows[0].cells))

for row_nr in num_rows:
    row = [cells[row_nr, col_nr] for col_nr in num_cols]
    print(row)
['type', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost']
['poc', 'before', 'before', 'before', 'before', 'after', 'after', 'after', 'after']
['poc', 'deploy', 'dev', 'test', 'support', 'deploy', 'dev', 'test', 'support']
['poc', '100', '200', '50', '50', '30', '200', '40', '20']
['uat', '200', '400', '555', '666', '201', '401', '557', '668']
['prod', '300', '600', '700', '900', '301', '601', '701', '901']