Closed ramakrse closed 1 year ago
Hi @ramakrse, there are a couple of things going on here. The first and most pertinent is that this PDF has actual lines drawn around the image (rather than pdfplumber
incorrectly using the image itself as part of the table-finding algorithm):
page = pdf.pages[66]
im = page.to_image()
im.draw_lines(page.lines, stroke_width=5)
Second, those lines around the image are getting incorporated into the table because of the 'intersection_tolerance': 32
in the settings.
I'm assuming that you've used that setting because of the gap between the literal rects in the table. In that case, in order to extract the table correctly, you'll want first to filter/crop out the lines that surround the image. Depending on your broader use-case, different approaches for that may work better than others. Here's what the filtering might look like, taking advantage of the difference in the linewidth
attributes for the image-surrounding (vs. table-relevant) lines:
filtered = page.filter(lambda obj: not (
obj["object_type"] == "line"
and obj["linewidth"] > 0
))
filtered.extract_table(ts)
[['Item', 'Type of shielding', 'Comments'],
['1', 'Shield A5', 'At union of return line to fuel\npump (HP stage)'],
['2', 'Shield A5', 'At union of supply line from fuel\npump (LP stage)'],
['3', 'Shield A5', 'At union of supply line to fuel filter'],
['4', 'Shield A5', 'At union of supply line from fuel\nfilter']]
Simply Super, @jsvine. It is working fine. I will try for Page 127 also
Describe the bug
PDF contains image, Table and Text. When extracting the table, it recognize image also as Table
Code to reproduce the problem
Load the PDF file with pdfplumber
plumber_file = pdfplumber.open(pdf_file) pdf_page = plumber_file.pages[67-1] im = pdf_page.to_image()
Table settings.
ts = { "vertical_strategy": "lines", "horizontal_strategy": "lines", 'intersection_tolerance': 32 } im.debug_tablefinder(ts)
PDF file
Please attach any PDFs necessary to reproduce the problem.
https://www.mtu-solutions.com/content/dam/mtu/technical-information/operating-instructions/diesel/mtu-series-1600/marine/MS15029_01E.pdf/_jcr_content/renditions/original./MS15029_01E.pdf
Refer the page 67
Expected behavior
It should not take image as Table.
Actual behavior
It recognize the Image as Table and There is a Table also. It takes gap between image and Table as row, which is not correct
Screenshots
Environment
Python 3.10.11 pdfplumber latest Collab
Additional context
In this document, Page 127, where we have image inside the table. More complex also, the number of line in the row is varying. Any suggestion, how do we handle the same.