Closed chengtie closed 8 months ago
Hi @chengtie
As I understand it, the reason it is omitted is because there is no "bottom" line closing off the final row.
What I've done in cases like this is to use the .find_*()
methods to return the table objects.
We can then take each horizontal line from the Row
to build a list of explicit_horizontal_lines
to pass to the .extract_*()
method.
Using the position of the "lowest" character on the page is one approach for generating your own "bottom" line.
from operator import itemgetter
[...]
explicit_horizontal_lines = set().union(*((row.bbox[1], row.bbox[3]) for row in page.find_table().rows))
explicit_horizontal_lines.add(max(page.chars, key=itemgetter("bottom"))["bottom"])
table = page.extract_table(dict(
explicit_horizontal_lines=explicit_horizontal_lines,
horizontal_strategy="explicit"
))
# Last row
print(table[-1])
['6',
'GEN',
'10/23/2023',
'9/18/2023',
'QUIT CLAIM\nDEED',
'LASTNAME, FIRST',
'LASTNAME, FIRST',
'Section:34\nTownship:80\nRange:21 Qtr\nSection:SE Qtr\nQtr Section:SE',
'2023-\n00005028',
'',
'2023-\n00004572',
'',
'3']
I have the following pdf file, and I would like to extract the tables and rows. Note that the last row (numbered by
6
) in the first page has not finished, the resting is on the second page. And I notice that theextract_table
function omits the last row in the first page.So is there way to be able to extract that incomplete row of the first page as well, so that I could combine it with the first row of the second page?
SAMPLE PDF.pdf
Here is my code: