jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

When a cell text in a table breaks a line, it will be parsed into two rows #488

Closed hopepanwei closed 3 years ago

hopepanwei commented 3 years ago

When a cell text in a table breaks a line, it will be parsed into two rows

002560.pdf

Simple parsing code:

    pdf = pdfplumber.open(path)
    p50 = pdf.pages[0]
    table = p50.extract_tables()
    print(table)

expect:

['信用减值损失(损失以“-”号填列)', '', ''], [None, '-5,237,613.31', '-36,590,747.55']

but:

['信用减值损失(损失以“-”号填', '', ''], [None, '-5,237,613.31', '-36,590,747.55'], ['列)', None, None]

I adjust vertical_ Strategy and horizontal_ Strategy, But it's not feasible Is there any other way to get the result I want thanks

samkit-jain commented 3 years ago

Hi @hopepanwei Appreciate your interest in the library. The reason you are getting this unexpected result is because there are some hidden horizontal lines on the page as you can see below (result of debugging the table using lines-lines strategy) image

To remove those, you can use the .filter() method and keep only the visible lines. Sample code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

def keep_visible_lines(obj):
    """
    If the object is a ``rect`` type, keep it only if the lines are visible.

    A visible line is the one having ``non_stroking_color`` as 0.
    """
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

# Filter out hidden lines.
p = p.filter(keep_visible_lines)

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

# Extract the table.
tables = p.extract_tables(table_settings=ts)

for table in tables:
    print()
    for row in table:
        print(row)

# Save the table image.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

Output image

['“-”号填列)', '', '']
['信用减值损失(损失以“-”号填\n列)', '-5,237,613.31', '-36,590,747.55']
['资产减值损失(损失以“-”号填\n列)', '286,747.04', '144,816.32']
['资产处置收益(损失以“-”号填\n列)', '-558,921.70', '42,106,204.80']
['三、营业利润(亏损以“-”号填列)', '52,907,764.67', '95,922,158.60']
['加:营业外收入', '70,918.18', '47,425.10']
['减:营业外支出', '489,072.73', '657,567.05']
['四、利润总额(亏损总额以“-”号填列)', '52,489,610.12', '95,312,016.65']
['减:所得税费用', '5,200,515.37', '9,062,954.02']
['五、净利润(净亏损以“-”号填列)', '47,289,094.75', '86,249,062.63']
['(一)按经营持续性分类', '', '']
['1.持续经营净利润(净亏损以“-”\n号填列)', '47,289,094.75', '86,249,062.63']
['2.终止经营净利润(净亏损以“-”\n号填列)', '', '']
['(二)按所有权归属分类', '', '']
['1.归属于母公司所有者的净利润', '47,615,098.03', '94,372,673.60']
['2.少数股东损益', '-326,003.28', '-8,123,610.97']
['六、其他综合收益的税后净额', '-4,066,527.50', '4,089,116.25']
['归属母公司所有者的其他综合收益\n的税后净额', '-4,066,527.50', '4,089,116.25']
['(一)不能重分类进损益的其他综\n合收益', '', '']
['1.重新计量设定受益计划变\n动额', '', '']
['2.权益法下不能转损益的其\n他综合收益', '', '']
['3.其他权益工具投资公允价\n值变动', '', '']
['4.企业自身信用风险公允价\n值变动', '', '']
['5.其他', '', '']
['(二)将重分类进损益的其他综合\n收益', '-4,066,527.50', '4,089,116.25']
BrianCKLu commented 2 years ago

HI @samkit-jain Thanks for the solution you provide. it is work for the file below.

But after applying it to the file below, it can't even read the normal form table.

Could you give me some hint or solution to solve this issue? many thanks!

note : i tryed

import pdfplumber

def keep_visible_lines(obj):
    """
    If the object is a ``rect`` type, keep it only if the lines are visible.

    A visible line is the one having ``non_stroking_color`` as 0.
    """
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

pdf = pdfplumber.open("file.pdf")

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

try:
    p = pdf.pages[0]
    p_f = p.filter(keep_visible_lines)
    tables = p.extract_tables(table_settings=ts)

except tables == []:
    tables = p.extract_tables(table_settings=ts)

for table in tables:
    print()
    for row in table:
        print(row)

# Save the table image.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

can solve this issue , but i think it's not the but I don't think this is the solution to eradication lol.

what if pdf file2's table need to solve breaks line issue the code above will failed

jsvine commented 2 years ago

Hi @BrianCKLu, I've now looked into this. With the new PDF you've shared, the default table extraction works fine:

import pdfplumber
pdf = pdfplumber.open("U5501_r.pdf")

page = pdf.pages[0]
im = page.to_image()

im.debug_tablefinder()

... produces this:

image

The custom code you're using runs into problems because there are multiple ways to define "black" in a PDF's graphics state, for instance in non_stroking_color. One is, as you've accounted for above, simply the integer 0. But your second PDF uses a different representation [ 0, 0, 0 ] (each item in the list stands for an RGB value). You can see this by running the following code:

for rect in page.rects:
    print(rect["non_stroking_color"])

... which will print:

[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
...

You could adjust your custom code to change this line:

return obj['non_stroking_color'] == 0

... to this:

return obj['non_stroking_color'] in (0, [0, 0, 0])

That would handle this particular case. But, just so you're aware: In other PDFs, you might still run into other "visible" lines that either aren't strictly black (e.g., just very dark gray) or use yet a different representation (see Section 4.5 on Color Spaces here).

HKAFITGlitter commented 9 months ago

Hi @hopepanwei Appreciate your interest in the library. The reason you are getting this unexpected result is because there are some hidden horizontal lines on the page as you can see below (result of debugging the table using lines-lines strategy) image

To remove those, you can use the .filter() method and keep only the visible lines. Sample code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

def keep_visible_lines(obj):
    """
    If the object is a ``rect`` type, keep it only if the lines are visible.

    A visible line is the one having ``non_stroking_color`` as 0.
    """
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

# Filter out hidden lines.
p = p.filter(keep_visible_lines)

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

# Extract the table.
tables = p.extract_tables(table_settings=ts)

for table in tables:
    print()
    for row in table:
        print(row)

# Save the table image.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

Output image

['“-”号填列)', '', '']
['信用减值损失(损失以“-”号填\n列)', '-5,237,613.31', '-36,590,747.55']
['资产减值损失(损失以“-”号填\n列)', '286,747.04', '144,816.32']
['资产处置收益(损失以“-”号填\n列)', '-558,921.70', '42,106,204.80']
['三、营业利润(亏损以“-”号填列)', '52,907,764.67', '95,922,158.60']
['加:营业外收入', '70,918.18', '47,425.10']
['减:营业外支出', '489,072.73', '657,567.05']
['四、利润总额(亏损总额以“-”号填列)', '52,489,610.12', '95,312,016.65']
['减:所得税费用', '5,200,515.37', '9,062,954.02']
['五、净利润(净亏损以“-”号填列)', '47,289,094.75', '86,249,062.63']
['(一)按经营持续性分类', '', '']
['1.持续经营净利润(净亏损以“-”\n号填列)', '47,289,094.75', '86,249,062.63']
['2.终止经营净利润(净亏损以“-”\n号填列)', '', '']
['(二)按所有权归属分类', '', '']
['1.归属于母公司所有者的净利润', '47,615,098.03', '94,372,673.60']
['2.少数股东损益', '-326,003.28', '-8,123,610.97']
['六、其他综合收益的税后净额', '-4,066,527.50', '4,089,116.25']
['归属母公司所有者的其他综合收益\n的税后净额', '-4,066,527.50', '4,089,116.25']
['(一)不能重分类进损益的其他综\n合收益', '', '']
['1.重新计量设定受益计划变\n动额', '', '']
['2.权益法下不能转损益的其\n他综合收益', '', '']
['3.其他权益工具投资公允价\n值变动', '', '']
['4.企业自身信用风险公允价\n值变动', '', '']
['5.其他', '', '']
['(二)将重分类进损益的其他综合\n收益', '-4,066,527.50', '4,089,116.25']

it doesn't work now.
versions of related packages: pdf2image 1.16.3 pypi_0 pypi pdfminer-six 20221105 pypi_0 pypi pdfplumber 0.10.3 pypi_0 pypi