jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Extracting tables with `explicit_vertical_lines` returns more columns than expected #820

Closed erip closed 1 year ago

erip commented 1 year ago

Describe the bug

If I extract a table with a len N list of explicit_vertical_lines indicating the [col1_left_line, col1_right_line, col2_right_line, ..., colNminus2_right_line, colNminus1_right_line], I find that I often extract many more than N-1 columns.

Code to reproduce the problem

#!/usr/bin/env python3

import pdfplumber
import pandas as pd

def extract_table(page):
    # determined by inspecting known-good extraction bboxes
    vertical_lines = [65.46, 186.62, 293.08, 413.10, 524.52]
    table = page.extract_table(table_settings={"explicit_vertical_lines": vertical_lines})
    df = pd.DataFrame(table)
    assert len(df.columns) == len(vertical_lines) - 1, f"Num extracted columns ({len(df.columns)}) != Expected num columns ({len(vertical_lines)-1})"

if __name__ == "__main__":
    pdf = pdfplumber.open('dico-karmous.pdf')
    for page_no in range(2,106):
        try:
            extract_table(pdf.pages[page_no])
        except Exception as e:
            print(e)

PDF file

Attached dico-karmous.pdf

Expected behavior

When explicitly passed, the number of extracted columns should be determinstic.

What did you expect the result should have been?

The number of extracted columns should match for identically formatted tables across pages.

Actual behavior

A different number of columns is extracted depending on page, with a wide variance in the number of columns.

Additional context

I've tried to "collapse" columns which are completely empty as an artifact of extraction, but it seems like many of the troublemaking columns have some bleedover from previous columns (maybe from stray tabs in the underlying data or similar?); not exactly sure.

Environment

erip commented 1 year ago

As a side-note: if I remove the table settings, I actually get many more extractions (whose quality is currently unknown).

samkit-jain commented 1 year ago

Hi @erip Appreciate your interest in the library. What you are seeing is expected because the default value of vertical_strategy is "lines". This leaves your table settings as

{
    "vertical_strategy": "lines",
    "explicit_vertical_lines": [65.46, 186.62, 293.08, 413.10, 524.52]
}

and apart from the explicit lines you have specified, it will also include the vertical lines that it has also identified because of the "lines" strategy. Because of this, you are seeing a non-deterministic output.

To make it deterministic, you should provide the vertical_strategy as "explicit" so that it only includes the lines that you have specified.

{
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": [65.46, 186.62, 293.08, 413.10, 524.52]
}
erip commented 1 year ago

Love an easy fix. Thanks, @samkit-jain!

erip commented 1 year ago

@samkit-jain do you think it makes sense to log a warning if explicit_X_lines is not empty and the X_strategy isn't "explicit"? Maybe this could prevent these types of things in the future and it should be a pretty quick fix. If you think so, I can send a PR.

samkit-jain commented 1 year ago

I can understand why this behaviour might be unintuitive. However, I have always considered explicit_... to be an extension over the X_strategy. The same is also mentioned in the documentation image If we give out a warning, it would imply that the usage is not the ideal one which is not the case here.