pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.11k stars 491 forks source link

pdf to image rendering ignore optional content offs #3806

Open PasaOpasen opened 3 weeks ago

PasaOpasen commented 3 weeks ago

Description of the bug

with some docs with already disabled optional content layers the rendered pages still contain them;

example link: https://dropmefiles.com/zTbp4

How to reproduce the bug

f = 'path/oc-for-ocr.pdf'
dpi = 150

import fitz
from PIL import Image

doc = fitz.open(f)
print(doc.layer_ui_configs())  # shows that almost all layers except "Text" are off

pix = doc[0].get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
img =Image.frombytes('RGB', (pix.width, pix.height), pix.samples)

img.show()  # displays image containing all layers info

PyMuPDF version

1.24.9

Operating system

Windows

Python version

3.8

JorjMcKie commented 3 weeks ago

Cannot download from the supplied link. Please provide a working one.

PasaOpasen commented 3 weeks ago

@JorjMcKie sent to your email

JorjMcKie commented 3 weeks ago

This is an upstream error. Opening a MuPDF report. Is this a confidential file, or can I simply attach it here?

PasaOpasen commented 3 weeks ago

@JorjMcKie sorry, it is confidential

is there any way to hotfix the problem? Like removing hidden layers content

PasaOpasen commented 3 weeks ago

pypdfium2 encounters same problem on this doc but pdf2image and python-poppler work well

JorjMcKie commented 3 weeks ago

A similar result from all the browsers: some show Text only, some show also other content.