jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Can not extract any tables from some pdf in Chinese #286

Closed playgithub closed 4 years ago

playgithub commented 4 years ago

pdfplumber version: 0.5.23

pdf: http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900002221&stockCode=601318&announcementId=1207316204&announcementTime=2020-02-21 (click the button on the top right to download the pdf, which has a download icon)

code:

import pdfplumber

path = '中国平安:2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

for page in pdf.pages:
    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
            row = [(item or "") for item in row]
            row_str = "|".join(row) + "\n"
            txt_file.write(row_str.encode())
        print('\n------------------------------------------------------------\n')

txt_file.close()

pdf.close()

result:

(cid:3668)(cid:18174)(cid:15546)(cid:17009)(cid:980)|(cid:3668)(cid:2152)(cid:10948)(cid:1426)(cid:5092)
(cid:18174)(cid:15546)(cid:12)(cid:12030)(cid:6061)|(cid:18174)(cid:15546)(cid:12)(cid:10828)(cid:5422)
(cid:14736)(cid:952)(cid:1095)|(cid:45)(cid:18)
(cid:19443)(cid:3087)(cid:6036)(cid:15737)(cid:3841)(cid:2437)(cid:1095)|
(cid:19443)(cid:3087)(cid:38)(cid:52)(cid:40)(cid:1995)(cid:1689)(cid:4305)|(cid:45)(cid:20)
(cid:38)(cid:52)(cid:40)(cid:6036)(cid:15737)(cid:4412)(cid:13297)(cid:9)(cid:36)(cid:52)(cid:51)(cid:16)(cid:42)(cid:51)(cid:16)(cid:49)(cid:51)(cid:16)(cid:19443)(cid:3087)(cid:13689)(cid:13866)(cid:858)(cid:5360)(cid:1040)(cid:15765)(cid:10)|
(cid:19443)(cid:3087)(cid:13689)(cid:13866)(cid:2178)(cid:1648)||(cid:45)(cid:21)
|(cid:5024)(cid:4278)(cid:19443)(cid:3087)(cid:14698)(cid:3165)(cid:4299)(cid:17186)(cid:11542)(cid:19298)|
(cid:6082)(cid:17009)(cid:13618)(cid:1696)(cid:12840)
(cid:2295)(cid:38)(cid:52)(cid:40)(cid:3841)(cid:2574)(cid:7216)|(cid:6082)(cid:17009)
(cid:12494)(cid:10547)(cid:3841)(cid:2437)(cid:1095)
(cid:12030)(cid:6061)
(cid:2302)(cid:4482)(cid:3841)(cid:2437)(cid:1095)|(cid:45)(cid:19)
(cid:5957)(cid:10898)(cid:827)(cid:6082)(cid:17009)(cid:1760)(cid:12419)(cid:3841)(cid:2437)(cid:1095)
(cid:4302)(cid:16590)(cid:827)(cid:19963)(cid:19350)(cid:12494)(cid:10547)(cid:3841)(cid:2437)(cid:1095)
(cid:6397)(cid:2362)(cid:3841)(cid:2437)(cid:1095)
(cid:15063)(cid:18073)(cid:3841)(cid:2437)(cid:1095)

BTW, it's ok to extract text

import pdfplumber

path = '中国平安:2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

for page in pdf.pages:
    txt = page.extract_text()

    if txt is None:
        txt = ""

    print(txt)
    print('\n------------------------------------------------------------\n')

    txt_file.write(txt.encode())
    txt_file.write('\n------------------------------------------------------------\n'.encode())

txt_file.close()

pdf.close()
samkit-jain commented 4 years ago

@playgithub Could you please specify the page number from which you want to extract the table and via a screenshot highlight the tabular region as well?

playgithub commented 4 years ago

In page 275

image

samkit-jain commented 4 years ago

For this page, you can use text-based strategy for table extraction. Use the following table settings for extraction

{
    "vertical_strategy": "text",
    "horizontal_strategy": "text"
}
playgithub commented 4 years ago

Thanks, it works, but there is an error:

...
['', '主提供信任额度;科技助力业务品质保持优良,实现营运利']
['2019年7月1日,董事长兼CEO马明哲先生携手全体平安人歌唱《我和我的祖', '']
['', '. . % . %']
['国》,献礼新中国成立70周年。', '润20952亿元,同比增长707 ,综合成本率964 。科技进']
['', '一步外溢赋能。金融壹账通凭借国际领先的区块链技术,产']
['', '']
['2019年,公司市值保持全球保险公司第一位,名列《财富》世', '品已覆盖中国所有大型银行、99%的城商行及52%的保险公']
['界500强第29位,《福布斯》全球上市公司2000强第7位,四度', '司,并于12月13日成功登陆美国纽约证券交易所,成为中国']
['蝉联BrandZ™全球第一保险品牌。公司实现归属于母公司股', '首家赴美上市的商业科技云服务平台企业。同时,我们积极']
Traceback (most recent call last):
  File "C:\Users\disc\Dev\Test\TestPdf\TestPdf\test_pdfplumber.py", line 25, in <module>
    txt_file.write(f"{str(row)}\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\u2122' in position 10: illegal multibyte sequence
Press any key to continue . . .

If table_settings is not set, no problem.

samkit-jain commented 4 years ago

Using str(row).encode("utf-8") should resolve the issue. You could also try opening the file like txt_file = open("result.txt", "w", encoding="utf-8")