Pdfplumber not reading last row of table

ffreller commented 4 years ago

What are you trying to do?

I'm trying to extract tables from official gazettes of Pará, one of Brazil's districts.

What code are you using to do it?

with pdfplumber.open(path1 + 'DOE_2020-02-20.pdf') as pdf: page = pdf.pages[43] tb = page.find_tables() for t in tb: res2.append(page.to_image().draw_rects(t.cells))

res2[0]

PDF file

DOE_2020-02-20.pdf

Expected behavior

I expect it would correctly find all tables in the page and all of its rows.

Actual behavior

No matter what gazette (of Pará) and page I read, the code returns the tables without the last row. I've tried to change the settings, but I've had no success in overcoming this issue.

Screenshots

res[0] res[1]

Environment

pdfplumber version: 0.5.23
Python version: Python 3.8.5
OS: Windows 10

Additional context

I don't if there is a fundamental problem in the way these documents are created. I would appreciate any suggestions on extracting its tables. Thank you

samkit-jain commented 4 years ago

Hi @ffreller Thanks for your interest in the library. Certainly I can reproduce this by using the following table settings

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

I can capture the correct tables by using the following table settings

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "explicit",
    "explicit_horizontal_lines": page.curves + page.edges
}

Does this solve your issue?

ffreller commented 4 years ago

It does! Thank you very much!

jsvine / pdfplumber