jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

[DISCUSSION] Handling out-of-page rect objects #267

Closed samkit-jain closed 4 years ago

samkit-jain commented 4 years ago

Prologue: May read like a story and has a lot of open-ended possibly-discussion-worthy questions.


v0.25.3 Table settings in use:

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

PDF


Output of .debug_tablefinder() on full page: image

One thing I noticed that the outer rectangle was missing connection dots at the bottom but I ignored it since the table was rightly captured.

If I crop ((0, 120, page.width, page.height)) the page from the top and then run .debug_tablefinder(), the output is image

This time the output is a bit different because the content outside the table is also captured.

How did the left column got selected in the table extraction after cropping the page even though it is well outside the intersection tolerance?

While I was trying to find out the cause for this, I noticed that this time, the outer rectangle had those connection dots present at both the top and bottom. No matter how much top or bottom portion I cropped, the behaviour persisted, the top red line and the bottom red line in the output implied the presence of a rect object even though there isn't one because we cut off the top portion of the outer rectangle and the top red line at the edge shouldn't be there. Also, how come the red line at the bottom edge appear after cropping? I did some debugging and found that certain objects had negative coordinate values. Here's a screengrab of the PyCharm debugger (notice the negative value in "y0"): image

The "y0" negative value meant that the bottom red line should ideally appear even if one did a zero crop ((0, 0, page.width, page.height)) and is verified by the following .debug_tablefinder() output: image

Result of drawing those negative value rect objects: image

The reason behind negative values is not due to a bug in pdfplumber or pdfminer.six but because of the PDF itself. To verify, I ran pdftk input.pdf output uncompressed.pdf uncompress and then opened uncompressed.pdf in a text editor and found 8 negative coordinate values in it. Not sure what purpose they serve ¯\_(ツ)_/¯

Should we treat negative coordinate values differently? Adding

if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
    return None

at https://github.com/jsvine/pdfplumber/blob/3c5041a20b142adb7505d845790fb1ba17132de0/pdfplumber/utils.py#L381 drops them but it only works when cropping the page.

print(len(page.objects["rect"]))
# 230
orig = page.objects["rect"]
page = page.crop((0, 0, page.width, page.height))
print(len(page.objects["rect"]))
# 222

.debug_tablefinder() output: image

Or if the rects are to be kept because they hold information that ideally should not be removed by the library, changes to get_bbox_overlap() or clip_obj() might be required. I also created a dummy PDF with the similar layout and in that, if I crop the page and run table finder, the left column is not picked up (as expected).

Uncropped: image

Cropped: image


Code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines"
}

# 1. Get the first screenshot
im = page.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 2. Get the second screenshot
cropped = page.crop((0, 120, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 3. Third screenshot from PyCharm

# 4. Get the fourth screenshot
cropped = page.crop((0, 0, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

# 5. Get the fifth screenshot
orig = page.objects["rect"]
cropped = page.crop((0, 0, page.width, page.height))
new = []
for obj in orig:
    if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
        new.append(obj)

im = page.to_image(resolution=150)
im.draw_rects(new)
im.save("out.png", format="PNG")

# Now add:
# if any(obj[key] < 0 for key in ["x0", "y0", "x1", "y1"]):
#     return None
# under ``clip_obj()`` in ``utils.py``

# 6. Get the sixth screenshot
cropped = page.crop((0, 0, page.width, page.height))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")
samkit-jain commented 4 years ago

Just realised that the dummy PDF I shared above is not a good replication example because it is built up of line objects while the original PDF is of rect objects.

samkit-jain commented 4 years ago

Working on more on this issue and trying out a bunch of things, I have come to understand that the results are the expected behaviour and I was confusing rect objects behaviour with line objects.