jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Can't extract tables with lines and but without bound rect #291

Closed playgithub closed 4 years ago

playgithub commented 4 years ago

pdf

http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=gssh0600352&stockCode=600352&announcementId=1207606607&announcementTime=2020-04-25 (click the button on the top right to download the pdf, which has a download icon)

page

133

image

code

import pdfplumber
import pandas as pd
import re
import matplotlib.pyplot as plt

def clean_table(table):
    for row in table:
        for i, val in enumerate(row):
            if (val.find('\n') != -1):
                row[i] = re.sub('\n', '', val, 0, 0)

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

path = '浙江龙盛:2019年年度报告.PDF'

def show_annotated_page(page):
    im = page.to_image(resolution=200)
    f = im.reset().debug_tablefinder()
    plt.figure()
    plt.imshow(f.annotated)
    plt.show()

pd.option_context('display.max_rows', None,
                  'display.max_columns', None,
                  'display.width', None,
                  'display.max_colwidth', None,
                  'display.unicode.ambiguous_as_wide', True,
                  'display.unicode.east_asian_width', True)

with pdfplumber.open(path) as pdf:    

    print('------------------------------------------------------------')

    page_index = 133
    for i in range(page_index - 1, page_index):
        page = pdf.pages[i]
        page = page.filter(keep_visible_lines)
        #show_annotated_page(page)
        for table in page.extract_tables():
            clean_table(table)
            df = pd.DataFrame(table)
            print(df)
            print('------------------------------------------------------------')

result

------------------------------------------------------------
               0
0  97,000,400.00
1  97,000,400.00
------------------------------------------------------------
                0
0             本期数
1  181,223,526.42
2  100,919,112.09
3    7,000,000.00
4      250,884.03
5        5,000.00
6
7
8  289,398,522.54
------------------------------------------------------------
                   0
0                本期数
1
2   5,306,258,923.50
3     203,414,953.22
4     661,219,043.72
5      39,681,655.41
6      15,417,234.53
7     -27,832,873.84
8      56,223,873.45
9      27,846,014.31
10    354,911,363.37
11   -572,739,357.93
12   -117,771,567.88
13    -18,724,217.52
14  1,346,231,191.81
15   -768,960,224.92
------------------------------------------------------------
jsvine commented 4 years ago

One possible strategy:

playgithub commented 4 years ago

Alright, after find an table fully lined, the table may not be complete, and it should consider the situation in the case.