Use the extract_table() method to parse out such a table

wuliKingQin commented 4 years ago

When I parse the pdf, I use the extract_table() method, whether the parameter information is passed ({ "vertical_strategy": "lines", "horizontal_strategy": "lines", }) Neither lines nor text can read the complete form information The pdf in question is as follows: error_dpf_3.pdf

Using lines, the table cannot be detected, using text, the parsed result is wrong：

平安银行股份有限公司
资产负债表
2019年12月31日(除特别注明外，金额单位均为人民币百万元)

Can you help me see how the pdf of the table in this situation can be parsed?

samkit-jain commented 4 years ago

Hi @CuteyBoy Could you confirm what version of pdfplumber are you using?

Running on 0.5.23, and using the following code

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

ts = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

im = p.to_image(resolution=150)
im.reset().debug_tablefinder(ts)
im.save("out.png", format="PNG")

tables = p.extract_tables(table_settings=ts)

for table in tables:
    for row in table:
        print(row)

I am getting the following response which appears to be almost correct

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外，金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资：', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']

The bottom line is missing and that is related to #265

As a workaround, you can crop the bottom portion of the page (p = p.crop((0, 0, p.width, p.height-100))) and then rerun which would give you the following response:

['平安银行股份有限公司', '', '', '']
['资产负债表', '', '', '']
['2019年12月31日', '', '', '']
['(除特别注明外，金额单位均为人民币百万元)', '', '', '']
['', '附注三', '2019年12月31日', '2018年12月31日']
['资产', '', '', '']
['现金及存放中央银行款项', '1', '252,230', '278,528']
['存放同业款项', '2', '85,684', '85,098']
['贵金属', '', '51,191', '56,835']
['拆出资金', '3', '79,369', '72,934']
['衍生金融资产', '4', '18,500', '21,460']
['买入返售金融资产', '5', '62,216', '36,985']
['发放贷款和垫款', '6', '2,259,349', '1,949,757']
['金融投资：', '', '', '']
['交易性金融资产', '7', '206,682', '148,768']
['债权投资', '8', '656,290', '629,366']
['其他债权投资', '9', '182,264', '70,664']
['其他权益工具投资', '10', '1,844', '1,519']
['投资性房地产', '11', '247', '194']
['固定资产', '12', '11,092', '10,899']
['使用权资产', '13', '7,517', '-']
['无形资产', '14', '4,361', '4,771']
['商誉', '15', '7,568', '7,568']
['递延所得税资产', '16', '34,725', '29,468']
['其他资产', '17', '17,941', '13,778']
['资产总计', '', '3,939,070', '3,418,592']

Does this resolve your issue?

wuliKingQin commented 4 years ago

thank you very much！

jsvine / pdfplumber

Use the extract_table() method to parse out such a table #268

Using lines, the table cannot be detected, using text, the parsed result is wrong：