Closed guo1017138 closed 3 years ago
Hi @guo1017138 Appreciate your interest in the library. In the PDF you shared, I believe it would be best if you crop the page and then parse the tables separately because they have a different structure. You can use the following code to do so.
import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
# First crop the top 1/3rd of the page.
cropped = p.crop((0, 0, p.width, 0.33*float(p.height)))
tables = cropped.extract_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"})
for table in tables:
for row in table:
print(row)
# ['a', 'vTest a', None]
# ['dfdfdddddddddffljlllllllllllllllllf\nfdfffff', '6', 'b']
# ['', '33', None]
# ['4', '8', '9.57']
# [None, '6', None]
# ['6', 'Happy', None]
# ['90', None, None]
# Then crop the remaining page.
cropped = p.crop((0, 0.33*float(p.height), p.width, p.height))
tables = cropped.extract_tables(table_settings={"vertical_strategy": "text", "horizontal_strategy": "lines"})
for table in tables:
for row in table:
print(row)
# ['34', '5', 't', 'f']
# ['j', 'z', 'f', 's']
# ['f', '', 's', '3']
# ['l', 'y', 'i', '0']
Table debug output for the top crop
Table debug output for the bottom crop
@samkit-jain , Awesome man! Really appreciate. It solves my problem and I did some enhancement based on your code. It suitable for any number tables in this page. (Just change the code to loop tables)
import pdfplumber
from operator import itemgetter
pdf = pdfplumber.open("../pdfs/badtable.pdf")
p0 = pdf.pages[0]
tables = p0.find_tables()
table = tables[1]
croppage = p0.crop((0, table.bbox[1], p0.width, table.bbox[3]))edgel = sorted(croppage.horizontal_edges, key=itemgetter("x0"))[0]
edger = sorted(croppage.horizontal_edges, key=itemgetter("x1"))[-1]
print(croppage.extract_table({"vertical_strategy": "lines", "explicit_vertical_lines": [edgel["x0"], edger["x1"]]}))
im.debug_tablefinder({"vertical_strategy": "lines", "explicit_vertical_lines": [edgel["x0"], edger["x1"]]})
[['34', '5', 't', 'f'], ['j', 'z', 'f', 's'], ['f', '', 's', '3'], ['l', 'y', 'i', '0']]
What are you trying to do?
A clear and concise description of your goals. I'm trying to extract table from pdf. The table has full horizon lines but only with vertical lines in the middle of table. It doesn't have right and left border. The table can't be extracted correctly, missing 2 columns.
What code are you using to do it?
Paste it here, or attach a Python file. With default table setting. The first table is correct, but the second table missing 2 columns. See image below.
With vertical_strtegy=text, the 2 tables is recognized as 1 table. worse.
PDF file
Please attach the PDFs used in the code. badtable.pdf
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been? The second table shall not miss the first and last column
Actual behavior
What actually happened, instead? missing first and last column for the second table
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.