jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Pdfplumber not reading last row of table #273

Closed ffreller closed 4 years ago

ffreller commented 4 years ago

What are you trying to do?

I'm trying to extract tables from official gazettes of Pará, one of Brazil's districts.

What code are you using to do it?

with pdfplumber.open(path1 + 'DOE_2020-02-20.pdf') as pdf: page = pdf.pages[43] tb = page.find_tables() for t in tb: res2.append(page.to_image().draw_rects(t.cells))

res2[0]

PDF file

DOE_2020-02-20.pdf

Expected behavior

I expect it would correctly find all tables in the page and all of its rows.

Actual behavior

No matter what gazette (of Pará) and page I read, the code returns the tables without the last row. I've tried to change the settings, but I've had no success in overcoming this issue.

Screenshots

res[0] image res[1] image

Environment

Additional context

I don't if there is a fundamental problem in the way these documents are created. I would appreciate any suggestions on extracting its tables. Thank you

samkit-jain commented 4 years ago

Hi @ffreller Thanks for your interest in the library. Certainly I can reproduce this by using the following table settings

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

I can capture the correct tables by using the following table settings

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "explicit",
    "explicit_horizontal_lines": page.curves + page.edges
}

image

Does this solve your issue?

ffreller commented 4 years ago

It does! Thank you very much!