Closed amouro closed 4 years ago
Hi @amouro Thanks for your interest in the library. When using the settings as
{
"vertical_strategy": "text",
"horizontal_strategy": "lines"
}
and saving the image with .debug_tablefinder()
, you'll notice that in the bottom portion, there is some difference between the vertical lines and horizontal lines.
To get around this, you would need to use intersection_tolerance
. Try using the following
{
"vertical_strategy": "text",
"horizontal_strategy": "lines",
"intersection_y_tolerance": 15,
}
Thank you @samkit-jain, it does resolve my problem.
I am curious what is your code creating the .debug_tablefinder()
image. I wasn't able to save the image with any line in your screenshot.
Here you go,
import pdfplumber
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "lines",
"intersection_y_tolerance": 15,
}
pdf = pdfplumber.open("file.pdf")
p0 = pdf.pages[0]
im = p0.to_image()
im.reset().debug_tablefinder(table_settings)
im.save("image.png", format="PNG")
What are you trying to do?
Extract data from a table with only horizontal line.
What code are you using to do it?
PDF file
PDF
Expected behavior
All table content are extracted.
Actual behavior
First and last row in the bbox are not extracted.
Screenshots
Environment
Additional context
Tried another table settings with explicit horizontal lines. Still cannot extract the first and last rows.