jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Edged character is lost after extracting the table #262

Closed fdq09eca closed 3 years ago

fdq09eca commented 4 years ago

I have the following block of code, forgive me.

import pandas as pd
from io import BytesIO
import pdfplumber, requests, re
url, p = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0731/2020073101878.pdf', 40
with requests.get(url) as response:
        response.raise_for_status()
        byte_obj = BytesIO(response.content)
pdf = pdfplumber.open(byte_obj)
page = pdf.pages[p]

df = pd.DataFrame(page.chars)
df = df[~df.text.str.contains(r'[^\x00-\x7F]+')]
main_fontsizes = df['size'].mode()
df = df[~df['size'].isin(main_fontsizes)]
df = df.groupby(['top', 'bottom'])['text'].apply(''.join).reset_index()
target_top = df[df.text.str.contains(r'AUDITORS REMUNERATION', flags=re.IGNORECASE)]['top'].values[0]
condition = (df.top > target_top) & (df.text.str.contains(r'\w+'))
next_title = df[condition].head(1)
target_bottom = next_title.top.values[0]

x0, x1 = 0, float(page.width)
page = page.crop((x0, float(target_top), x1, float(target_bottom)), relative=True)

print(page.extract_text())

print('====')

setting = {
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
}

for li in page.extract_table(setting):
    if re.match(r'.*?\d+|^.{3,4}(\d{3})?$', li[-1]):
        print(''.join(li[:-1]), "|",li[-1])

it gives me this

AUDITORS’ REMUNERATION  核數師酬金
During the Year, the remuneration in respect of audit and non-audit  於本年度內,與本公司核數師提供之核數
services provided by the Company’s auditors are set out as follows:  及非核數服務有關之酬金列示如下:
Services rendered Fee paid/payable 提供服務 已付╱應付費用
HK$’000 千港元
Audit services 1,250 核數服務 1,250
Non audit services 160 非核數服務 160
DIRECTOR’S SECURITIES TRANSACTIONS 董事之證券交易
====
HK$’000 | 千港元
Audit services1,250 | 核數服務 1,25
Non audit services160 | 非核數服務 16

you may see after extracting the table, the trailing zero disappeared. (before - after: 1,250- 1,25, 160 - 16) my question is, how do I get the trailing zero? I tried debug_tablefinder but I can't find out the reason.

jsvine commented 4 years ago

Hi @fdq09eca, it appears that the character highlighted in blue in the screenshot below is causing the issue. It is positioned just a little bit to the left of those cut-off characters, enough to confuse the text-strategy algorithm.

Screen Shot 2020-08-31 at 10 37 50 PM

The text-strategy algorithm could certainly use some improvement. In the meantime, one approach would be to use a crop that is more focused on the table itself, excluding the problematic character. This code, modified from what you provided, seems to work (although I don't know if it achieves the ultimate goal you are seeking):

df = pd.DataFrame(page.chars)

headings = (
    df
    .loc[lambda df: (
        df["fontname"].str.contains("HelveticaNeueLTStd-Md") &
        (df["size"] == 12)
    )]
    .groupby([ "top", "bottom" ])
    ['text']
    .apply(''.join)
    .reset_index()
)

subheadings = (
    df
    .loc[lambda df: (
        df["fontname"].str.contains("HelveticaNeueLTStd-Md") &
        (df["size"] == 9)
    )]
    .groupby([ "top", "bottom" ])
    ['text']
    .apply(''.join)
    .reset_index()
)

target_top = (
    subheadings
    .loc[lambda df: df["text"].str.contains(r'Services rendered')]
    ["top"]
    .values[0]
)

target_bottom = (
    headings
    .loc[lambda df: df["top"] > target_top]
    ["top"]
    .values[0]
    - 1
)

cropped = page.crop((0, target_top, page.width, target_bottom))

print(cropped.extract_text())

print('====')

setting = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

for li in cropped.extract_table(setting):
    if re.match(r'.*?\d+|^.{3,4}(\d{3})?$', li[-1]):
        print(''.join(li[:-1]), "|",li[-1])

Output:

Services rendered Fee paid/payable 提供服務 已付╱應付費用
HK$’000 千港元
Audit services 1,250 核數服務 1,250
Non audit services 160 非核數服務 160
====
HK$’000 | 千港元
Audit services1,250核數服務 | 1,250
Non audit services160非核數服務 | 160
fdq09eca commented 4 years ago

@jsvine thank you. is it possible to extend the edge allowance to include that 0?

jsvine commented 4 years ago

I don't quite understand what you mean by "edge allowance"? Can you provide more details about the question?

jsvine commented 3 years ago

Just checking back on this, @fdq09eca.