jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Can't extract a table without lines #288

Closed playgithub closed 4 years ago

playgithub commented 4 years ago

pdf

http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900005970&stockCode=601668&announcementId=1207611180&announcementTime=2020-04-25 (click the button on the top right to download the pdf, which has a download icon)

page

277

image

code

import pdfplumber
import pandas as pd
import re
import matplotlib.pyplot as plt

def clean_table(table):
    for row in table:
        for i, val in enumerate(row):
            if (val.find('\n') != -1):
                row[i] = re.sub('\n', '', val, 0, 0)

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

path = '中国建筑:2019年年度报告.PDF'
ts = {"vertical_strategy": "text",
      "horizontal_strategy": "text",
      "text_tolerance": 10}

def show_annotated_page(page):
    im = page.to_image(resolution=200)
    f = im.reset().debug_tablefinder()
    plt.figure()
    plt.imshow(f.annotated)
    plt.show()

pd.option_context('display.max_rows', None,
                  'display.max_columns', None,
                  'display.width', None,
                  'display.max_colwidth', None,
                  'display.unicode.ambiguous_as_wide', True,
                  'display.unicode.east_asian_width', True)

with pdfplumber.open(path) as pdf:    

    print('------------------------------------------------------------')

    for i in range(276, 277):
        page = pdf.pages[i]
        page = page.filter(keep_visible_lines)
        #show_annotated_page(page)
        for table in page.extract_tables(table_settings=ts):
            clean_table(table)
            df = pd.DataFrame(table)
            print(df)
            print('------------------------------------------------------------')

result

          0                1                 2            3
0      中国建筑           股份有限公司
1      财务报表               附注
2    2019 年                度
3     (除特别注   明外,金额单位为人民币千元)
4     四  合并      财务报表项目附注(续)
5     67、现金          流量表补充资料
6   (a)  将净    利润调节为经营活动现金流量
7                                      2019 年度      2018 年度
8        净利                润        63,205,243   55,350,200
9        加:           资产减值损失          (73,370)   10,465,899
10                    信用减值损失         3,611,595            —
11                    固定资产折旧         6,543,253    6,530,713
12                  投资性房地产折旧         1,718,108    1,514,560
13                    无形资产摊销           447,821      440,444
14                  长期待摊费用摊销           338,210      189,875
15           处置固定资产、无形资产和其他长
16                    期资产的收益         (568,141)     (175,112
17                      财务费用        10,179,757   12,568,535
18                  公允价值变动损失           484,752      368,343
19                      投资收益       (4,212,538)   (5,646,311
20                递延所得税资产的增加       (3,083,170)   (2,511,075
21           递延所得税负债的增加/(减少)           107,125     (644,386
22                  存货的增加  (       70,420,830)  (96,125,732
23              受限资金的(增加)/减少       (1,387,681)      280,759
24            经营性应收项目的增加  (1  25,133,902)    (  108,365,569
25                经营性应付项目的增加        83,217,173  135,881,219
26                        其他           806,518      188,928
samkit-jain commented 4 years ago

Hi @playgithub For page 277, my recommendation to you would be to use explicit vertical lines. You may optionally crop the page like

page = page.crop((0, 0.26*float(page.height), page.width, 0.75*float(page.height)))  # Remove the top 26% and bottom 25%.

to get better results.

The table settings using explicit lines would look like

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [Decimal(p.width) * Decimal('0.12'), Decimal(p.width) * Decimal('0.55'), Decimal(p.width) * Decimal('0.75'), Decimal(p.width) * Decimal('0.95')],
    "intersection_x_tolerance": 20,
}

The explicit_vertical_lines is a list of coordinate values for lines at the 12%, 55%, 75% and 95% mark.

Closing this issue as well. Feel free to reopen if any further queries.

playgithub commented 4 years ago

I'm sure it'll work for the specific page, but there are many tables to extract, assistant coordinates can't help.