jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Fewer horizontal lines when using text strategy #265

Closed samkit-jain closed 3 years ago

samkit-jain commented 4 years ago

Describe the bug

In https://github.com/jsvine/pdfplumber/commit/d224202bdf488cf83141415282b8727913e9c989, the logic for finding horizontal and vertical lines connecting n number of words was simplified. When finding the horizontal lines, the logic was updated to keep the "top" of all line rects and the bottom of only the last line rect. This is causing problems with table detection as the final number of horizontal lines has reduced and when the gap between 2 rows is big, it can provide inconsistent results when used together with snap_tolerance.

The height of the line is also not in sync with the height of text it possesses.

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]
ts = {"vertical_strategy": "lines", "horizontal_strategy": "text"}
im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

tables = p.extract_tables(table_settings=ts)
for table in tables:
    for row in table:
        print(row)

PDF file

The PDF file can be found here.

Expected behavior

On versions before v0.5.23 Last row of the table: ['金', '', ''] image

Actual behavior

On v0.5.23 Last row of the table: ['支付其他与投资活动有关的现', '', ''] image

Environment

Additional context

Causes trouble when the vertical spacing between 2 words is big.

To some context, the change makes sense as a horizontal row of text is sandwiched between 2 lines and there are no consecutive empty lines (as can be found in the Expected Behavior's screenshot). What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)? The current implementation has picked the top of the bottom row. image

I would prefer that it is reverted to the older approach in which both the top of the bottom row and bottom of the top row were kept and leave it up to the user to filter since there is no one-filter-suits-all. What are your thoughts @jsvine ?

TODO (from: @samkit-jain ): I looked at the horizontal edges but perhaps it affected vertical edges as well and I should test that out as well.

samkit-jain commented 4 years ago

https://github.com/jsvine/pdfplumber/issues/265#issue-690400685 has been edited with table extraction related issue info.

jsvine commented 4 years ago

Thank you for the beautifully detailed bug report, @samkit-jain! My hunch is that the problem stems from the "fixes" (😬) in https://github.com/jsvine/pdfplumber/commit/d224202, which were released as part of 0.5.23. I'll see if I can pinpoint the specific mechanism that is causing the unexpected behavior.

samkit-jain commented 4 years ago

Yes, if I undo the changes that were made to words_to_edges_h(), it works fine for horizontal rows. Another possible solution apart from just reverting the commit could be to update the current method by adding a check to add a new imaginary horizontal line at the bottom of the top row when the difference between the bottom of the top row and top of the bottom row crosses a certain threshold. The tricky part in this would be selecting a global threshold value.

jsvine commented 4 years ago

Thanks for confirming that, @samkit-jain. I'll try seeing if it's possible to correct the fixes (rather than just reverting the commit), so that they retain the simplicity. Hopefully, it's just a matter of squashing a bug :)

jsvine commented 4 years ago

Having spent a little bit more time looking at this ... I've come to believe that the behavior in v0.5.23 isn't wrong, per se — although it is, as you note, unexpected.

In this particular case, the issue seems to stem from the page number at the bottom of of the page (21) getting lumped in with the rest of the table when, in fact, the vertical lines don't extend that far down.

If you crop the page first (and increase the intersection tolerance), the table parses as expected:

import pdfplumber

pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "text",
    "intersection_x_tolerance": 10,
}
cropped = p.crop((0, 120, p.width, p.height - 70))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)

image

(Last row of the table is, as expected: ['金', '', ''])

Even so, there is — as you note — probably a more robust way for pdfplumber to handle situations like these. Reverting the d224202 changes is one option, but I wonder whether there's a way still to improve it. I'm going to give it another thinking session, but am open to suggestions as well.

What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)?

I've been puzzling over this and, unfortunately, I don't think there's an answer that will satisfy all/most cases. Even in the example in this issue, just putting the line halfway between the two rows would not solve the problem, since the 21 is so far from the vertical lines of the table.

samkit-jain commented 3 years ago

Another option could also be to revert the change and introduce 2 new table settings parameters snap_x_tolerance and snap_y_tolerance and let the user use snap_x_tolerance to combine the horizontal lines. The same can also be said for the vertical lines.

jsvine commented 3 years ago

Thanks, @samkit-jain! I think giving users the option to set snap_x_tolerance and snap_y_tolerance independently is a good idea, regardless. I'll add it to my todo list. I'm having trouble, though, visualizing how these options would apply to this particular issue. Could you expand the idea slightly?

jsvine commented 3 years ago

Closed core of this issue via #466 and #467 — though adding snap_x_tolerance and snap_y_tolerance still seems like a good idea. Thanks again @samkit-jain!