pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

Images missing from TextPage dictionary #3446

Closed EluuuArcanum closed 1 month ago

EluuuArcanum commented 1 month ago

Description of the bug

Since version 1.21.1 images are not present in the TextPage dictionary. So the examnple in section 2 in the docs does not work: https://pymupdf.readthedocs.io/en/latest/recipes-images.html#how-to-extract-images-non-pdf-documents

How to reproduce the bug

This code snippet does not work, no images are present in any pdf files.

d = page.get_text("dict") blocks = d["blocks"] # the list of block dictionaries imgblocks = [b for b in blocks if b["type"] == 1] pprint(imgblocks[0])

PyMuPDF version

1.24.2

Operating system

Windows

Python version

3.8

JorjMcKie commented 1 month ago

This is no bug! The default text extraction flag bits have changed since that time. If you need image meta info in this text extraction variant us the composite bit value present in e.g. TEXTFLAGS_DICT and things will work as before.