how to get colspan or rowspan info in the table?

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.1k stars 625 forks source link

Describe the bug

test_span.pdf

the table row0 col (1,2,3,4) is combine into row 0 col 1,when extracted, col2 col3, col4, is None [['type', 'cost', None, None, None, 'cost', None, None, None], ['poc', 'before', None, None, None, 'after', None, None, None], [None, 'deploy', 'dev', 'test', 'support', 'deploy', 'dev', 'test', 'support'], [None, '100', '200', '50', '50', '30', '200', '40', '20'], ['uat', '200', '400', '555', '666', '201', '401', '557', '668'], ['prod', '300', '600', '700', '900', '301', '601', '701', '901']]

can pdfplumber output like html to show the relation

type

cost　

poc

before

after

deploy

dev

test

support

deploy

dev

test

support

html code is:

<html> <head> </head> <table border="1 " width="900"> <tr> <td width="11% "> type </td> <td width="11% " colspan="4"> cost　 </td> <td width="11% " colspan="4"> cost　 </td> </tr> <tr> <td width="11% " rowspan="2" > poc </td> <td width="11% " colspan="4"> before </td> <td width="11% " colspan="4"> after </td> </tr> <tr> <td width="11% "> deploy </td> <td width="11% "> dev </td> <td width="11% "> test </td> <td width="11% "> support </td> <td width="11% "> deploy </td> <td width="11% "> dev </td> <td width="11% "> test </td> <td width="11% "> support </td> </tr> </table> </html>

import pdfplumber pdf = pdfplumber.open("./data/test_span.pdf") p0 = pdf.pages[0] p0 = p0.filter(keep_visible_lines) im = p0.to_image() im.debug_tablefinder() table = p0.find_tables() table[0].extract()

I know that you can use .find_tables() to get the table objects.

You can look at their .rows, .cells, etc.

>>> table = page.find_tables()[0]
>>> rows = table.rows
>>> rows
[<pdfplumber.table.Row at 0x1437ab400>,
 <pdfplumber.table.Row at 0x1375c0a00>,
 <pdfplumber.table.Row at 0x137216590>,
 <pdfplumber.table.Row at 0x137214e50>,
 <pdfplumber.table.Row at 0x137214430>,
 <pdfplumber.table.Row at 0x137216200>]

im.reset().draw_rect(rows[1].cells[0], stroke_width=5)

im.reset().draw_rect(rows[1].cells[1], stroke_width=5)

# type rowspan
>>> (rows[0].cells[0][-1] - rows[0].cells[0][1]) / (rows[0].cells[1][-1] - rows[0].cells[1][1])
1.0
# poc rowspan
>>> (rows[1].cells[0][-1] - rows[1].cells[0][1]) / (rows[1].cells[1][-1] - rows[1].cells[1][1])
3.0103686635944618

Not sure if pdfplumber attempts to use this information or not.

Update: Perhaps you could do something like this:

pdf = pdfplumber.open("Downloads/test_span.pdf")
page = pdf.pages[0]

table = page.find_tables()[0]

# size of smallest col and row for reference
col_unit = min(int(cell[2] - cell[0]) for cell in table.cells if cell)
row_unit = min(int(cell[3] - cell[1]) for cell in table.cells if cell)

cells = {}
# Process in reverse order so we can modify
for row_nr in range(len(table.rows) - 1, -1, -1):
    row = table.rows[row_nr]

    for col_nr in range(len(row.cells) - 1, -1, -1):
        cell = row.cells[col_nr]

        text = None

        if cell is not None:
            colspan = int(cell[2] - cell[0]) // col_unit
            rowspan = int(cell[3] - cell[1]) // row_unit

            text = page.crop(cell).extract_text()

            # forward_fill column
            for new_col in range(colspan):
                cells[row_nr, col_nr + new_col] = text

            # forward_fill row
            for new_row in range(rowspan):
                cells[row_nr + new_row, col_nr] = text

        cells[row_nr, col_nr] = text

num_rows = range(len(table.rows))
num_cols = range(len(table.rows[0].cells))

for row_nr in num_rows:
    row = [cells[row_nr, col_nr] for col_nr in num_cols]
    print(row)

['type', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost', 'cost']
['poc', 'before', 'before', 'before', 'before', 'after', 'after', 'after', 'after']
['poc', 'deploy', 'dev', 'test', 'support', 'deploy', 'dev', 'test', 'support']
['poc', '100', '200', '50', '50', '30', '200', '40', '20']
['uat', '200', '400', '555', '666', '201', '401', '557', '668']
['prod', '300', '600', '700', '900', '301', '601', '701', '901']

jsvine / pdfplumber

how to get colspan or rowspan info in the table? #927

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context