jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

page.extract_table() result is None? #508

Closed yts2020 closed 2 years ago

yts2020 commented 3 years ago

When I run page.extract_table(),the result is [[''], [''], [''], ['']]? Selection_100

samkit-jain commented 3 years ago

Hi @yts2020 Appreciate your interest in the library. Could you please update the issue with the details as mentioned in https://github.com/jsvine/pdfplumber/blob/develop/.github/ISSUE_TEMPLATE/bug-report.md?

kt10001 commented 2 years ago

Hi @samkit-jain
my code on macos is ok, but on centos7 is None.

this is my pdf file. ali001.pdf

pdfplumber==0.5.28

image

my code

import pdfplumber
import re

path = '/tmp/ali001.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages[:2]:
    print(page.extract_text())
    for pdf_table in page.extract_tables():
        print(pdf_table)
        table = []
        cells = []
        for row in pdf_table:
            if not any(row):
                if any(cells):
                    table.append(cells)
                    cells = []
            elif all(row):
                if any(cells):
                    table.append(cells)
                    cells = []
                table.append(row)
            else:
                if len(cells) == 0:
                    cells = row
                else:
                    for i in range(len(row)):
                        if row[i] is not None:
                            cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
        for row in table:
            print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])
        print('---------- ----------')

pdf.close()

run result image

kt10001 commented 2 years ago

its pdf version or font ?

how i solve

samkit-jain commented 2 years ago

@feikongl I too am unable to extract the text on my Ubuntu 18.04 machine. The PDF contains the font STSONG and I think it is a duplicate of https://github.com/jsvine/pdfplumber/issues/332. If you are able to run on MacOS, it could be that it contains the font and is able to map correctly. CentOS 7 may not have the font and is unable to map. I haven't tried but I found the following and it might be of help to you

  1. https://programmerall.com/article/6175933567/
  2. https://stackoverflow.com/questions/17435434/missing-chinese-characters-on-linux-running-birt-report

If you find any other solution that helped you install the font and resolve the issue, please share here.

kt10001 commented 2 years ago

@samkit-jain thank you. i will try.