jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

extract_table omits the last row that is incomplete #1024

Closed chengtie closed 8 months ago

chengtie commented 8 months ago

I have the following pdf file, and I would like to extract the tables and rows. Note that the last row (numbered by 6) in the first page has not finished, the resting is on the second page. And I notice that the extract_table function omits the last row in the first page.

So is there way to be able to extract that incomplete row of the first page as well, so that I could combine it with the first row of the second page?

SAMPLE PDF.pdf

Here is my code:

def process_pdf(file_path, output_dir):
    # Extract file name (without extension) from the file path
    file_name_prefix = os.path.splitext(os.path.basename(file_path))[0]

    headers = []
    prev_row = None  # Variable to store the previous row
    with pdfplumber.open(file_path) as pdf:
        # Loop through all the pages in the PDF
        for page_number, page in enumerate(pdf.pages):
            # Extract the table from the current page
            table = page.extract_table()

            if not headers:  # If headers are not yet extracted
                for row in table:
                    if None not in row:  # Change this condition based on your criteria
                        headers = row
                        start_row = table.index(row) + 1
                        break

            # If headers are still not extracted (i.e., a valid header row wasn't found), skip this page
            if not headers:
                continue

            # Loop through the rows in the table starting from the appropriate row
            for row_number, row in enumerate(table[start_row:], start=start_row):
                print(row[0])
                print(row[1])
                print(row[2])
                if row[0] is None or row[0] == '':  # Check if the first cell is empty
                    if prev_row:  # If there is a previous row to concatenate with
                        row = [p if r is None else r for p, r in zip(prev_row, row)]  # Concatenate the rows
                else:
                    prev_row = row  # Save the current row as previous row

                process_row(row_number, row, headers, page_number, output_dir, file_name_prefix)
cmdlineluser commented 8 months ago

Hi @chengtie

As I understand it, the reason it is omitted is because there is no "bottom" line closing off the final row.

What I've done in cases like this is to use the .find_*() methods to return the table objects.

We can then take each horizontal line from the Row to build a list of explicit_horizontal_lines to pass to the .extract_*() method.

Using the position of the "lowest" character on the page is one approach for generating your own "bottom" line.

from operator import itemgetter

[...]

explicit_horizontal_lines = set().union(*((row.bbox[1], row.bbox[3]) for row in page.find_table().rows))
explicit_horizontal_lines.add(max(page.chars, key=itemgetter("bottom"))["bottom"])

table = page.extract_table(dict(
   explicit_horizontal_lines=explicit_horizontal_lines, 
   horizontal_strategy="explicit"
))

# Last row
print(table[-1])
['6',
 'GEN',
 '10/23/2023',
 '9/18/2023',
 'QUIT CLAIM\nDEED',
 'LASTNAME, FIRST',
 'LASTNAME, FIRST',
 'Section:34\nTownship:80\nRange:21 Qtr\nSection:SE Qtr\nQtr Section:SE',
 '2023-\n00005028',
 '',
 '2023-\n00004572',
 '',
 '3']