Closed ffreller closed 4 years ago
Hi @ffreller Thanks for your interest in the library. Certainly I can reproduce this by using the following table settings
{
"vertical_strategy": "lines",
"horizontal_strategy": "lines"
}
I can capture the correct tables by using the following table settings
{
"vertical_strategy": "lines",
"horizontal_strategy": "explicit",
"explicit_horizontal_lines": page.curves + page.edges
}
Does this solve your issue?
It does! Thank you very much!
What are you trying to do?
I'm trying to extract tables from official gazettes of Pará, one of Brazil's districts.
What code are you using to do it?
with pdfplumber.open(path1 + 'DOE_2020-02-20.pdf') as pdf: page = pdf.pages[43] tb = page.find_tables() for t in tb: res2.append(page.to_image().draw_rects(t.cells))
res2[0]
PDF file
DOE_2020-02-20.pdf
Expected behavior
I expect it would correctly find all tables in the page and all of its rows.
Actual behavior
No matter what gazette (of Pará) and page I read, the code returns the tables without the last row. I've tried to change the settings, but I've had no success in overcoming this issue.
Screenshots
res[0] res[1]
Environment
Additional context
I don't if there is a fundamental problem in the way these documents are created. I would appreciate any suggestions on extracting its tables. Thank you