jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Use the extract_table() method to parse out such a table #268

Closed wuliKingQin closed 4 years ago

wuliKingQin commented 4 years ago

When I parse the pdf, I use the extract_table() method, whether the parameter information is passed ({ "vertical_strategy": "lines", "horizontal_strategy": "lines", }) Neither lines nor text can read the complete form information The pdf in question is as follows: error_dpf_3.pdf

Using lines, the table cannot be detected, using text, the parsed result is wrong:

平安银行股份有限公司
资产负债表
2019年12月31日(除特别注明外,金额单位均为人民币百万元)

Can you help me see how the pdf of the table in this situation can be parsed?

samkit-jain commented 4 years ago

Hi @CuteyBoy Could you confirm what version of pdfplumber are you using?

Running on 0.5.23, and using the following code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

ts = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

tables = p.extract_tables(table_settings=ts)

for table in tables:
    for row in table:
        print(row)

I am getting the following response which appears to be almost correct

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外,金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资:', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']

image

The bottom line is missing and that is related to #265

As a workaround, you can crop the bottom portion of the page (p = p.crop((0, 0, p.width, p.height-100))) and then rerun which would give you the following response:

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外,金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资:', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']
['资产总计', '', '3,939,070', '3,418,592']

image

Does this resolve your issue?

wuliKingQin commented 4 years ago

thank you very much!