Closed playgithub closed 4 years ago
@playgithub Could you please specify the page number from which you want to extract the table and via a screenshot highlight the tabular region as well?
In page 275
For this page, you can use text-based strategy for table extraction. Use the following table settings for extraction
{
"vertical_strategy": "text",
"horizontal_strategy": "text"
}
Thanks, it works, but there is an error:
...
['', '主提供信任额度;科技助力业务品质保持优良,实现营运利']
['2019年7月1日,董事长兼CEO马明哲先生携手全体平安人歌唱《我和我的祖', '']
['', '. . % . %']
['国》,献礼新中国成立70周年。', '润20952亿元,同比增长707 ,综合成本率964 。科技进']
['', '一步外溢赋能。金融壹账通凭借国际领先的区块链技术,产']
['', '']
['2019年,公司市值保持全球保险公司第一位,名列《财富》世', '品已覆盖中国所有大型银行、99%的城商行及52%的保险公']
['界500强第29位,《福布斯》全球上市公司2000强第7位,四度', '司,并于12月13日成功登陆美国纽约证券交易所,成为中国']
['蝉联BrandZ™全球第一保险品牌。公司实现归属于母公司股', '首家赴美上市的商业科技云服务平台企业。同时,我们积极']
Traceback (most recent call last):
File "C:\Users\disc\Dev\Test\TestPdf\TestPdf\test_pdfplumber.py", line 25, in <module>
txt_file.write(f"{str(row)}\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\u2122' in position 10: illegal multibyte sequence
Press any key to continue . . .
If table_settings
is not set, no problem.
Using str(row).encode("utf-8")
should resolve the issue. You could also try opening the file like txt_file = open("result.txt", "w", encoding="utf-8")
pdfplumber version: 0.5.23
pdf: http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900002221&stockCode=601318&announcementId=1207316204&announcementTime=2020-02-21 (click the button on the top right to download the pdf, which has a download icon)
code:
result:
BTW, it's ok to extract text