jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

How can I extract table without left and right vertical border correctly #307

Closed guo1017138 closed 3 years ago

guo1017138 commented 3 years ago

What are you trying to do?

A clear and concise description of your goals. I'm trying to extract table from pdf. The table has full horizon lines but only with vertical lines in the middle of table. It doesn't have right and left border. The table can't be extracted correctly, missing 2 columns.

What code are you using to do it?

Paste it here, or attach a Python file. With default table setting. The first table is correct, but the second table missing 2 columns. See image below.

pdf = pdfplumber.open("../pdfs/badtable.pdf")
p0 = pdf.pages[0]
im.debug_tablefinder()

image

With vertical_strtegy=text, the 2 tables is recognized as 1 table. worse.

pdf = pdfplumber.open("../pdfs/badtable.pdf")
p0 = pdf.pages[0]
im.debug_tablefinder({"vertical_strategy": "text"})

image

PDF file

Please attach the PDFs used in the code. badtable.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been? The second table shall not miss the first and last column

Actual behavior

What actually happened, instead? missing first and last column for the second table

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

samkit-jain commented 3 years ago

Hi @guo1017138 Appreciate your interest in the library. In the PDF you shared, I believe it would be best if you crop the page and then parse the tables separately because they have a different structure. You can use the following code to do so.

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

# First crop the top 1/3rd of the page.
cropped = p.crop((0, 0, p.width, 0.33*float(p.height)))
tables = cropped.extract_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"})
for table in tables:
    for row in table:
        print(row)
# ['a', 'vTest a', None]
# ['dfdfdddddddddffljlllllllllllllllllf\nfdfffff', '6', 'b']
# ['', '33', None]
# ['4', '8', '9.57']
# [None, '6', None]
# ['6', 'Happy', None]
# ['90', None, None]

# Then crop the remaining page.
cropped = p.crop((0, 0.33*float(p.height), p.width, p.height))
tables = cropped.extract_tables(table_settings={"vertical_strategy": "text", "horizontal_strategy": "lines"})
for table in tables:
    for row in table:
        print(row)
# ['34', '5', 't', 'f']
# ['j', 'z', 'f', 's']
# ['f', '', 's', '3']
# ['l', 'y', 'i', '0']

Table debug output for the top crop image

Table debug output for the bottom crop image

guo1017138 commented 3 years ago

@samkit-jain , Awesome man! Really appreciate. It solves my problem and I did some enhancement based on your code. It suitable for any number tables in this page. (Just change the code to loop tables)

import pdfplumber
from operator import itemgetter

pdf = pdfplumber.open("../pdfs/badtable.pdf")
p0 = pdf.pages[0]
tables = p0.find_tables()
table = tables[1]
croppage = p0.crop((0, table.bbox[1], p0.width, table.bbox[3]))edgel = sorted(croppage.horizontal_edges, key=itemgetter("x0"))[0]
edger = sorted(croppage.horizontal_edges, key=itemgetter("x1"))[-1]
print(croppage.extract_table({"vertical_strategy": "lines", "explicit_vertical_lines": [edgel["x0"], edger["x1"]]}))
im.debug_tablefinder({"vertical_strategy": "lines", "explicit_vertical_lines": [edgel["x0"], edger["x1"]]})

[['34', '5', 't', 'f'], ['j', 'z', 'f', 's'], ['f', '', 's', '3'], ['l', 'y', 'i', '0']] image