Text Extraction from PDF Results in Garbled Characters

Issue

When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.

I'm unsure if this is related to encoding settings or if there's a way to correct the extraction process. Any guidance or fixes would be appreciated.

PDF is as attached

king arthur.pdf The Phantom of the Opera.pdf

pypdf version

pypdf==4.3.1

Code Sample:

from pypdf import PdfReader

def get_text_from_pdf_by_pypdf(file_path):
    try:
        text = ''
        reader = PdfReader(file_path.absolute())

        pages = reader.pages
        number_of_pages = len(pages)
        for page_obj in pages:
            text_add = page_obj.extract_text()
            if text_add:
                text += text_add

        return text, number_of_pages

    except Exception as e:
        print(e)
        return None, None

Output

The extracted content includes cid values such as:


7KHGDQFHUV

4XLFN4XLFN&ORVHWKHGRRU,W
VKLP
$QQLH6RUHOOLUDQLQW RWKH
GUHVVLQJURRPKHUIDFHZKLWH
2QHRIWKHJLUOVUDQDQGFORVHGWKHGRRUDQGWKHQWKH\DOOWXU QHGWR
$QQLH6RUHOOL

py-pdf / pypdf

Text Extraction from PDF Results in Garbled Characters #2807

Issue

PDF is as attached

pypdf version

Code Sample:

Output