pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.62k stars 524 forks source link

Can't extract images for this PDF #3936

Closed bbfrog closed 1 month ago

bbfrog commented 1 month ago

Description of the bug

Monaleesa_full.pdf Pymupdf can't extract images in page 2 and page 4 of this pdf.

How to reproduce the bug

import pymupdf doc = pymupdf.open('Monaleesa_full.pdf')

page_num = 0 for page in doc:   page_num += 1   images = page.get_images(full=True)
  print(f'page {page_num}: {len(images)} images')

PyMuPDF version

1.24.11

Operating system

MacOS

Python version

3.12

JorjMcKie commented 1 month ago

Except for page 7 (0-based), none of the pages contains an image. What you see are vector graphics - no images.

JorjMcKie commented 1 month ago

Vector graphics cannot be extracted. All you can do is making a "photo" of the respective page area ...

bbfrog commented 1 month ago

Acrobat API can extract the vector graphics and save as png or svg. How does it do this? Is it hard to support in Pymupdf? THanks!

JorjMcKie commented 1 month ago

You can try this script. Or do this:

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    for i, bbox in enumerate(page.cluster_drawings()):
        pix = page.get_pixmap(clip=bbox, dpi=150)
        pix.save(f"{doc.name}-{page.number}-{i}.png")
bbfrog commented 1 month ago

Thanks @JorjMcKie very much. It works and can extract the image I want. But it also extracted tables from this pdf as drawing, is there any field can differentiate the tables with other drawing? Thanks!