py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.29k stars 1.4k forks source link

Text Extraction from PDF Results in Garbled Characters #2807

Closed hrhktkbzyy closed 2 months ago

hrhktkbzyy commented 2 months ago

Issue

When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.

I'm unsure if this is related to encoding settings or if there's a way to correct the extraction process. Any guidance or fixes would be appreciated.

PDF is as attached

king arthur.pdf The Phantom of the Opera.pdf

pypdf version

pypdf==4.3.1

Code Sample:

from pypdf import PdfReader

def get_text_from_pdf_by_pypdf(file_path):
    try:
        text = ''
        reader = PdfReader(file_path.absolute())

        pages = reader.pages
        number_of_pages = len(pages)
        for page_obj in pages:
            text_add = page_obj.extract_text()
            if text_add:
                text += text_add

        return text, number_of_pages

    except Exception as e:
        print(e)
        return None, None

Output

The extracted content includes cid values such as:


7KHGDQFHUV

4XLFN4XLFN&ORVHWKHGRRU,W
VKLP
$QQLH6RUHOOLUDQLQW RWKH
GUHVVLQJURRPKHUIDFHZKLWH
2QHRIWKHJLUOVUDQDQGFORVHGWKHGRRUDQGWKHQWKH\DOOWXU QHGWR
$QQLH6RUHOOL
pubpub-zz commented 2 months ago

your documents have been protected against copy : pypdf reports the same results as other viewers. There is nothing to be done but e-print/scan with OCR. this has been reported many times