pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.55k stars 520 forks source link

get_images() function doubt #3843

Closed Thharshita closed 2 months ago

Thharshita commented 2 months ago

Description of the bug

image

For above pdf page, why it is giving me so many image reference when in total there are only 3 images??

image image I m trying to extract text from pdf and its image. For extracting images i have used page.get_images() but it gave me too many image tuple, only DCTDECODE tuple was referiing to right image {Fig 3} , rest all was giving vague image, hence i specifically considered DCTDECODE . But is there any way i can extract vector image? i.e flowchart, flowdiadram types image??

Also is there any way I can retrive the lable of the image??? i.e {Fig. 3. An example of using intelligent transportation systems for road traffic management through information on variable speed limit signs} Thankyou.

How to reproduce the bug

pip install pymupdf

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.11

JorjMcKie commented 2 months ago

Please be nice and attach the PDF as it is mandatory for bug reports.

As a note: Just because you believe that you are seeing only 3 images this may not at all be the truth! I am willing to engage in a bet that these 2 flow charts are no images at all (but vector graphics - a completely different animal).

And that remaining "single" street scene image may have been composed of multiple tiny images, superposing each other or stitched together.

JorjMcKie commented 2 months ago

Vector graphics are composed of very many drawing instructions usually. To make an image of these, we must first find a way to cluster drawing instructions. Then create an image (Pixmap) of that identified clustered area of the page. There is no way to directly extract vector graphics. Find vector graphics cluster rectangles via Page.cluster_drawings().

JorjMcKie commented 2 months ago

I am going to transfer this "bug" report to a Discussions items, as you obviously are not reporting a PyMuPF problem.