jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

Two tables on the same page are extracted as one #871

Closed filips123 closed 1 year ago

filips123 commented 1 year ago

Describe the bug

On a PDF document where there are multiple tables on the same page, both of them are extracted as one table. This causes misaligned and empty columns.

Code to reproduce the problem

import pdfplumber
import pandas as pd

file = pdfplumber.open("example document.pdf")

page = file.pages[0]
tables = page.extract_tables()
df = pd.DataFrame(tables[0][1:], columns=table[0][0])

PDF file

example document.pdf

This is an example document with redacted information. Real documents have a similar structure, but with more pages, sometimes with multiple and different tables.

Expected behavior

There should be two tables:

Text between those two tables (MENJAVA UR) should not be in any of tables.

Actual behavior

What actually happened, instead?

Tables are extracted as one. This causes a lot of extra and misaligned columns with None:

actual table

Screenshots

table finder debug

Environment

Additional context

PDFs that I need to parse all share a similar layout with around 5 different table formats. However, not all files use all of those table formats, and not all tables have the same height, making it impossible to just crop the page before extracting tables.

Maybe this can be solved by adding an option to only extract tables where there are visible lines around cells?