Closed samkit-jain closed 3 years ago
https://github.com/jsvine/pdfplumber/issues/265#issue-690400685 has been edited with table extraction related issue info.
Thank you for the beautifully detailed bug report, @samkit-jain! My hunch is that the problem stems from the "fixes" (😬) in https://github.com/jsvine/pdfplumber/commit/d224202, which were released as part of 0.5.23
. I'll see if I can pinpoint the specific mechanism that is causing the unexpected behavior.
Yes, if I undo the changes that were made to words_to_edges_h()
, it works fine for horizontal rows. Another possible solution apart from just reverting the commit could be to update the current method by adding a check to add a new imaginary horizontal line at the bottom of the top row when the difference between the bottom of the top row and top of the bottom row crosses a certain threshold. The tricky part in this would be selecting a global threshold value.
Thanks for confirming that, @samkit-jain. I'll try seeing if it's possible to correct the fixes (rather than just reverting the commit), so that they retain the simplicity. Hopefully, it's just a matter of squashing a bug :)
Having spent a little bit more time looking at this ... I've come to believe that the behavior in v0.5.23
isn't wrong, per se — although it is, as you note, unexpected.
In this particular case, the issue seems to stem from the page number at the bottom of of the page (21
) getting lumped in with the rest of the table when, in fact, the vertical lines don't extend that far down.
If you crop the page first (and increase the intersection tolerance), the table parses as expected:
import pdfplumber
pdf = pdfplumber.open("issue-67-example.pdf")
p = pdf.pages[20]
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "text",
"intersection_x_tolerance": 10,
}
cropped = p.crop((0, 120, p.width, p.height - 70))
im = cropped.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
(Last row of the table is, as expected: ['金', '', '']
)
Even so, there is — as you note — probably a more robust way for pdfplumber
to handle situations like these. Reverting the d224202 changes is one option, but I wonder whether there's a way still to improve it. I'm going to give it another thinking session, but am open to suggestions as well.
What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)?
I've been puzzling over this and, unfortunately, I don't think there's an answer that will satisfy all/most cases. Even in the example in this issue, just putting the line halfway between the two rows would not solve the problem, since the 21
is so far from the vertical lines of the table.
Another option could also be to revert the change and introduce 2 new table settings parameters snap_x_tolerance
and snap_y_tolerance
and let the user use snap_x_tolerance
to combine the horizontal lines. The same can also be said for the vertical lines.
Thanks, @samkit-jain! I think giving users the option to set snap_x_tolerance
and snap_y_tolerance
independently is a good idea, regardless. I'll add it to my todo list. I'm having trouble, though, visualizing how these options would apply to this particular issue. Could you expand the idea slightly?
Closed core of this issue via #466 and #467 — though adding snap_x_tolerance
and snap_y_tolerance
still seems like a good idea. Thanks again @samkit-jain!
Describe the bug
In https://github.com/jsvine/pdfplumber/commit/d224202bdf488cf83141415282b8727913e9c989, the logic for finding horizontal and vertical lines connecting
n
number of words was simplified. When finding the horizontal lines, the logic was updated to keep the "top" of all line rects and the bottom of only the last line rect. This is causing problems with table detection as the final number of horizontal lines has reduced and when the gap between 2 rows is big, it can provide inconsistent results when used together withsnap_tolerance
.The height of the line is also not in sync with the height of text it possesses.
Code to reproduce the problem
PDF file
The PDF file can be found here.
Expected behavior
On versions before v0.5.23 Last row of the table:
['金', '', '']
Actual behavior
On v0.5.23 Last row of the table:
['支付其他与投资活动有关的现', '', '']
Environment
Additional context
Causes trouble when the vertical spacing between 2 words is big.
To some context, the change makes sense as a horizontal row of text is sandwiched between 2 lines and there are no consecutive empty lines (as can be found in the Expected Behavior's screenshot). What could be debated is where to put the line between 2 rows of text? Should it be in the middle (purple line)? Top of the bottom row (green line)? Bottom of the top row (orange line)? The current implementation has picked the top of the bottom row.
I would prefer that it is reverted to the older approach in which both the top of the bottom row and bottom of the top row were kept and leave it up to the user to filter since there is no one-filter-suits-all. What are your thoughts @jsvine ?
TODO (from: @samkit-jain ): I looked at the horizontal edges but perhaps it affected vertical edges as well and I should test that out as well.