jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Recognize workflow images created by MS visio as text #1055

Closed a4073631 closed 9 months ago

a4073631 commented 10 months ago

thank you for providing a good pdf parser.

If a figure is not saved as an image in a pdf, it will be extracted as text. ex) visio

Is there a way to extract pdf of these structures as an image as well?

Attached is the problematic pdf.

Code to reproduce the problem

import pdfplumber pdf = pdfplumber.open('hi_test.pdf') page = pdf.pages[0] print(page.images) output page.images= []

PDF file

hi_test.pdf

Environment