jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

chars x1 > page.width #283

Closed fdq09eca closed 4 years ago

fdq09eca commented 4 years ago

I have this colab. It produces the following result

AUDITOR’S REMUNERATION 核數師之酬金
The remuneration in respect of services provided by Ernst & Young (“EY”)  就安永會計師事務所(「
is analysed as follows: 付的酬金分析如下:
Annual audit services 年度審核服務
Non-audit services* 非審核服務*

*  Such non-audit services include agreed-upon procedures on preliminary  *  該等非審核服務包括
announcement of annual results, transaction advisory, tax advisory and compliance  之協定程序、交易諮
services.

But I would like to have

AUDITOR’S REMUNERATION 核數師之酬金
The remuneration in respect of services provided by Ernst & Young (“EY”)  就安永會計師事務所(「安永」)提供之服務支
is analysed as follows: 付的酬金分析如下:
HK$’000
千港元
Annual audit services 年度審核服務 5,100
Non-audit services* 非審核服務* 2,587

7,687

*  Such non-audit services include agreed-upon procedures on preliminary  *  該等非審核服務包括有關全年業績之初步公告
announcement of annual results, transaction advisory, tax advisory and compliance  之協定程序、交易諮詢、稅務諮詢及合規服務。
services.

which is achievable if I change the definition of df_char from

class Page:
...
    @property
    def df_char(self) -> pd.DataFrame:
        df = pd.DataFrame(self.page.chars)
        df_langs = {
            'en': df[~df['text'].str.contains(r'[^\x00-\x7F]+')],
            'cn': df[df['text'].str.contains(r'[^\x00-\x7F]+')]
        }
        df = df_langs.get(self.df_lang, df)
        normal_bbx_coord = (df.x0 > 0) & (df.top > 0) & (df.x1 > 0) & (df.bottom > 0)
        normal_x1 = df['x1'] <= self.page.width
        within_bbx = normal_bbx_coord & normal_x1
        df_char_within_bbox = df[within_bbx]
        return df_char_within_bbox
...

to

class Page:
...
    @property
    def df_char(self) -> pd.DataFrame:
        df = pd.DataFrame(self.page.chars)
        df_langs = {
            'en': df[~df['text'].str.contains(r'[^\x00-\x7F]+')],
            'cn': df[df['text'].str.contains(r'[^\x00-\x7F]+')]
        }
        df = df_langs.get(self.df_lang, df)
        normal_bbx_coord = (df.x0 > 0) & (df.top > 0) & (df.x1 > 0) & (df.bottom > 0)
        within_bbx = normal_bbx_coord
        df_char_within_bbox = df[within_bbx]
        return df_char_within_bbox

you may see the within_bbx is changed from within_bbx = normal_bbx_coord & normal_x1 to within_bbx = normal_bbx_coord The aims is to trim off the non-textual area I can not understand that there is char outside the page width, and peculiarly, whenever the remove_noise() method is used, page.page.width decreases. I think it is a bug that I produced myself when I only crop off the page by pdfplumber, every text is still within the page.width, but when it comes to my class then this bug happens. I have been struggling and battling with it for long. Any suggestion will be appreciated

=== UPDATE ===

urls = [
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0827/2020082700690.pdf',
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0721/2020072100713.pdf',
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0721/2020072100653.pdf'
    ]

def job(url):
    with requests.get(url) as response:
        response.raise_for_status()
        byte_obj = BytesIO(response.content)
        pdf = pdfplumber.open(byte_obj)
        problem_pages = []

    for page in pdf.pages:
        try:
            df_char = pd.DataFrame(page.chars)
            if not df_char[df_char.x1 > page.width].empty:
                problem_pages.append(page)
        except AttributeError as e:
            continue

    print(f'{url}: {problem_pages}')

for url in urls:
    job(url)

it produces

https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0827/2020082700690.pdf: [<Page:4>, <Page:6>, <Page:8>, <Page:10>, <Page:12>, <Page:14>, <Page:16>, <Page:18>, <Page:20>, <Page:22>, <Page:24>, <Page:26>, <Page:28>, <Page:30>, <Page:32>, <Page:34>, <Page:36>, <Page:38>, <Page:40>, <Page:42>, <Page:44>, <Page:46>, <Page:48>, <Page:50>, <Page:52>, <Page:54>, <Page:56>, <Page:58>, <Page:60>, <Page:62>, <Page:64>, <Page:66>, <Page:68>, <Page:70>, <Page:72>, <Page:74>, <Page:76>, <Page:78>, <Page:80>, <Page:82>, <Page:84>, <Page:86>, <Page:88>, <Page:90>, <Page:92>, <Page:94>, <Page:96>, <Page:98>, <Page:100>, <Page:102>, <Page:104>, <Page:106>, <Page:108>, <Page:110>, <Page:112>, <Page:114>, <Page:116>, <Page:118>, <Page:120>, <Page:122>, <Page:124>, <Page:126>, <Page:128>, <Page:130>, <Page:132>, <Page:134>, <Page:136>, <Page:138>, <Page:140>, <Page:142>, <Page:144>, <Page:146>, <Page:148>, <Page:150>, <Page:152>, <Page:154>, <Page:156>, <Page:158>, <Page:160>, <Page:162>, <Page:164>, <Page:166>, <Page:168>, <Page:170>, <Page:172>, <Page:174>, <Page:176>, <Page:178>, <Page:180>, <Page:182>, <Page:184>, <Page:186>, <Page:188>, <Page:190>]
https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0721/2020072100713.pdf: [<Page:58>]
https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0721/2020072100653.pdf: []

you may see some of them are normal, but some of them are not. @jsvine, is there something wrong with the df_char? and why is it?

jsvine commented 4 years ago

Hi @fdq09eca, PDFs are allowed to place characters outside of its mediabox (or cropbox). If you want to automatically remove them, you could run page_inside = page.within_bbox(page.bbox).

fdq09eca commented 4 years ago

@jsvine will the mediabox adjust respectively when the page is cropped?

jsvine commented 4 years ago

Yes, the .crop(...) and .within_bbox(...) methods automatically adjust the bbox property: https://github.com/jsvine/pdfplumber/blob/3afd08620f345adbf60d5a21c1e201535745239f/pdfplumber/page.py#L315-L322

fdq09eca commented 4 years ago

Yes, the .crop(...) and .within_bbox(...) methods automatically adjust the bbox property:

https://github.com/jsvine/pdfplumber/blob/3afd08620f345adbf60d5a21c1e201535745239f/pdfplumber/page.py#L315-L322

excellent. Thank you, it gives me the reason of the bug.