pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.14k stars 494 forks source link

PDF file to image, data conversion error #3729

Closed Agoin-max closed 2 months ago

Agoin-max commented 2 months ago

Description of the bug

PDF file to image, data conversion error

import fitz
import tempfile
from pathlib import Path
from PIL import Image

def pdf2png_with_pymupdf(pdf_data: Union[bytes, str], matrix: int = 2):
    images: List[Image.Image] = []
    path = tempfile.mkdtemp()
    path_ = Path(path)

    try:
        if isinstance(pdf_data, bytes):
            pdf_path = str(path_.joinpath("mypdf.pdf"))
            with open(pdf_path, "wb") as fs:
                fs.write(pdf_data)
        else:
            pdf_path = pdf_data

        doc = fitz.open(pdf_path)
        for page_index in range(len(doc)):
            page = doc.load_page(page_index)
            pix = page.get_pixmap(matrix=fitz.Matrix(matrix, matrix))  # type: ignore
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)  # type: ignore
            images.append(img)
        doc.close()
    finally:
        delete_temp_directory(path)
    return images

354195 COMPLETE 2.pdf

How to reproduce the bug

Where data errors occur : First page of pdf file origin data image: 20240726-115954

After converting to image: 20240726-120142

PyMuPDF version

1.24.6

Operating system

Windows

Python version

3.9

JorjMcKie commented 2 months ago

This file contains errors than prevent successful page rendering. MuPDF message: warning: non-embedded font using identity encoding: CourierStd-Bold (mapping via TrueType-UCS2). Other tools also have that problem.

Agoin-max commented 1 month ago

ok. Thanks