jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Multiple Tables of banded shaded rows with varying number of lines in row #882

Closed ramakrse closed 1 year ago

ramakrse commented 1 year ago

Describe the bug

PDF has multiple tables across the documents. Tables are shaded/banded rows with varying lines in row

Code to reproduce the problem

Load the PDF file with pdfplumber

plumber_file = pdfplumber.open(pdf_file) pdf_page = plumber_file.pages[29-1] #127 #67 im = pdf_page.to_image()

Table settings.

ts = { "vertical_strategy": "lines", "horizontal_strategy": "lines", 'intersection_tolerance': 32 } im.debug_tablefinder(ts)

PDF file

Using the Public available pdf https://www.mtu-solutions.com/content/dam/mtu/technical-information/operating-instructions/diesel/mtu-series-1600/marine/MS15029_01E.pdf/_jcr_content/renditions/original./MS15029_01E.pdf

Expected behavior

To identify the tables in each page properly. Here there are two tables

Actual behavior

playing with intersection_tolerance variable to handle more lines in a row, it detect one table, Space between tables also consider as row. Not able to detect two tables properly

Screenshots

image

Environment

Additional context

Add any other context/notes about the problem here.