pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.45k stars 512 forks source link

Text Extraction from PDF Results in Garbled Characters #3799

Closed hrhktkbzyy closed 2 months ago

hrhktkbzyy commented 2 months ago

Description of the bug

When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.

I'm unsure if this is related to encoding settings or if there is a way to correct this behavior. Any guidance or potential fixes would be appreciated.

How to reproduce the bug

PDF is as attached

king arthur.pdf The Phantom of the Opera.pdf

Code Sample:

import fitz

def get_text_from_pdf_by_pymupdf(file_path):
    try:
        text = ''
        pages = fitz.open(file_path.absolute())
        number_of_pages = len(pages)
        for page_obj in pages:
            text_add = page_obj.get_text()
            if text_add:
                text += text_add
        return text, number_of_pages

    except Exception as e:
        print(e)
        return None, None

Output:

The extracted content includes cid values such as:

7KHGDQFHUV

4XLFN 4XLFN &ORVH WKH GRRU ,W
V KLP
 $QQLH 6RUHOOL UDQ LQWR WKH
GUHVVLQJURRPKHUIDFHZKLWH
2QH RI WKH JLUOV UDQ DQG FORVHG WKH GRRU DQG WKHQ WKH\ DOO WXUQHG WR
$QQLH6RUHOOL

PyMuPDF version

1.24.9

Operating system

MacOS

Python version

3.12

JorjMcKie commented 2 months ago

This is no bug, but goes back to properties / deficiencies of the used font(s). If a glyph contains no back-reference to the Unicode that originated it, then there is no way to determine the Unicode. This is what is happening in every case where a � appears.

In addition, PyMuPDF's default extraction flags use the glyph number instead of the Unicode then the Unicode's value is 0xFFFD (which delivers that �). So you can try the etraction using flags=0 and see what happens instead.

But as you report: when other extractors also deliver crab, then we have just bad luck!