jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

First and last row of table with only horizontal line cannot be extract #247

Closed amouro closed 4 years ago

amouro commented 4 years ago

What are you trying to do?

Extract data from a table with only horizontal line.

What code are you using to do it?

        tables = page_crop.find_tables({
                "vertical_strategy": "text", 
                "horizontal_strategy": "lines",
        })
        page_crop.to_image(resolution=200).debug_table(tables[0]).save("./deb.png", format="PNG")

PDF file

PDF

Expected behavior

All table content are extracted.

Actual behavior

First and last row in the bbox are not extracted.

Screenshots

table

table_crop

Environment

Additional context

Tried another table settings with explicit horizontal lines. Still cannot extract the first and last rows.

tables = page_crop.find_tables({
                "vertical_strategy": "text", 
                "horizontal_strategy": "explicit",
                "explicit_horizontal_lines": rects_to_edges(page.rects)
        })
samkit-jain commented 4 years ago

Hi @amouro Thanks for your interest in the library. When using the settings as

{
    "vertical_strategy": "text", 
    "horizontal_strategy": "lines"
}

and saving the image with .debug_tablefinder(), you'll notice that in the bottom portion, there is some difference between the vertical lines and horizontal lines. image

To get around this, you would need to use intersection_tolerance. Try using the following

{
    "vertical_strategy": "text",
    "horizontal_strategy": "lines",
    "intersection_y_tolerance": 15,
}
amouro commented 4 years ago

Thank you @samkit-jain, it does resolve my problem.

I am curious what is your code creating the .debug_tablefinder() image. I wasn't able to save the image with any line in your screenshot.

samkit-jain commented 4 years ago

Here you go,

import pdfplumber

table_settings = {
    "vertical_strategy": "text",
    "horizontal_strategy": "lines",
    "intersection_y_tolerance": 15,
}

pdf = pdfplumber.open("file.pdf")
p0 = pdf.pages[0]
im = p0.to_image()
im.reset().debug_tablefinder(table_settings)
im.save("image.png", format="PNG")