Text Extraction from PDF Results in Garbled Characters

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

GNU Affero General Public License v3.0

5.45k stars 512 forks source link

Description of the bug

When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.

I'm unsure if this is related to encoding settings or if there is a way to correct this behavior. Any guidance or potential fixes would be appreciated.

How to reproduce the bug

PDF is as attached

king arthur.pdf The Phantom of the Opera.pdf

Code Sample:

import fitz

def get_text_from_pdf_by_pymupdf(file_path):
    try:
        text = ''
        pages = fitz.open(file_path.absolute())
        number_of_pages = len(pages)
        for page_obj in pages:
            text_add = page_obj.get_text()
            if text_add:
                text += text_add
        return text, number_of_pages

    except Exception as e:
        print(e)
        return None, None

Output:

The extracted content includes cid values such as:

7KHGDQFHUV

4XLFN 4XLFN &ORVH WKH GRRU ,W
V KLP
 $QQLH 6RUHOOL UDQ LQWR WKH
GUHVVLQJURRPKHUIDFHZKLWH
2QH RI WKH JLUOV UDQ DQG FORVHG WKH GRRU DQG WKHQ WKH\ DOO WXUQHG WR
$QQLH6RUHOOL

PyMuPDF version

1.24.9

Operating system

MacOS

Python version

3.12

pymupdf / PyMuPDF