jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

Image in PDF is recognized as Table #881

Closed ramakrse closed 1 year ago

ramakrse commented 1 year ago

Describe the bug

PDF contains image, Table and Text. When extracting the table, it recognize image also as Table

Code to reproduce the problem

Load the PDF file with pdfplumber

plumber_file = pdfplumber.open(pdf_file) pdf_page = plumber_file.pages[67-1] im = pdf_page.to_image()

Table settings.

ts = { "vertical_strategy": "lines", "horizontal_strategy": "lines", 'intersection_tolerance': 32 } im.debug_tablefinder(ts)

PDF file

Please attach any PDFs necessary to reproduce the problem.

https://www.mtu-solutions.com/content/dam/mtu/technical-information/operating-instructions/diesel/mtu-series-1600/marine/MS15029_01E.pdf/_jcr_content/renditions/original./MS15029_01E.pdf

Refer the page 67

Expected behavior

It should not take image as Table.

Actual behavior

It recognize the Image as Table and There is a Table also. It takes gap between image and Table as row, which is not correct

Screenshots

image

Environment

Python 3.10.11 pdfplumber latest Collab

Additional context

In this document, Page 127, where we have image inside the table. More complex also, the number of line in the row is varying. Any suggestion, how do we handle the same.

jsvine commented 1 year ago

Hi @ramakrse, there are a couple of things going on here. The first and most pertinent is that this PDF has actual lines drawn around the image (rather than pdfplumber incorrectly using the image itself as part of the table-finding algorithm):

page = pdf.pages[66]
im = page.to_image()
im.draw_lines(page.lines, stroke_width=5)

image

Second, those lines around the image are getting incorporated into the table because of the 'intersection_tolerance': 32 in the settings.

I'm assuming that you've used that setting because of the gap between the literal rects in the table. In that case, in order to extract the table correctly, you'll want first to filter/crop out the lines that surround the image. Depending on your broader use-case, different approaches for that may work better than others. Here's what the filtering might look like, taking advantage of the difference in the linewidth attributes for the image-surrounding (vs. table-relevant) lines:

filtered = page.filter(lambda obj: not (
  obj["object_type"] == "line" 
  and obj["linewidth"] > 0
))
filtered.extract_table(ts)
[['Item', 'Type of shielding', 'Comments'],
 ['1', 'Shield A5', 'At union of return line to fuel\npump (HP stage)'],
 ['2', 'Shield A5', 'At union of supply line from fuel\npump (LP stage)'],
 ['3', 'Shield A5', 'At union of supply line to fuel filter'],
 ['4', 'Shield A5', 'At union of supply line from fuel\nfilter']]
ramakrse commented 1 year ago

Simply Super, @jsvine. It is working fine. I will try for Page 127 also