jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

obtaining pictures in PDF #888

Closed willzgr closed 1 year ago

willzgr commented 1 year ago

May I ask if the next version supports the function of obtaining pictures in PDF and manipulating pictures, as well as the function of substituting text by obtaining the coordinates of pictures?

jsvine commented 1 year ago

obtaining pictures in PDF

I would like to add features for more easily accessing/saving images embedded in PDFs. I cannot, however, guarantee that such features will be available in the next version of pdfplumber.

and manipulating pictures

pdfplumber is focused solely on extracting information from PDFs, and has no support (current or planned) for manipulating PDFs (or the images inside them).

as well as the function of substituting text by obtaining the coordinates of pictures?

I don't think I fully understand this part of the question. But you can currently access the coordinates of images embedded in PDFs, via the attributes provided in the page.images objects

willzgr commented 1 year ago

Thank you very much for your reply, I would like to explain my question again in detail. Now I have a need to convert PDF containing pictures and text into pure documents, so I have an idea that I can use the library of pdfplumber to obtain the location of pictures in PDF, OCR technology or other means. It is ok to convert pictures into text and put them back in the original position, and then use other PDF libraries to turn them into pure documents. However, it seems that the library of pdfplumber does not support to know whether the object of the coordinate is a picture, text or other according to the coordinate, nor can it replace the content. Maybe I haven't found out how to call it yet.

jsvine commented 1 year ago

Hi @willzgr, you are correct: pdfplumber cannot replace content; it is focused only on reading PDFs and extracting information from them, rather than creating or editing them.

As for coordinates: The page.images property will return all images in the PDF, although this will only work for original digital PDFs, rather than rasterized or scanned PDFs that have been OCR-ed.

XiYuan68 commented 1 year ago

Hi @jsvine I wonder if extracting images is available in pdfplumber now? I am trying to get a pillow image some codes like this but not getting any luck:

Image.frombytes('RGB', image['srcsize'], image['stream'].rawdata)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 2969, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 830, in frombytes
    raise ValueError(msg)
ValueError: not enough image data

Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?

XiYuan68 commented 1 year ago

Hi @jsvine I wonder if extracting images is available in pdfplumber now? I am trying to get a pillow image some codes like this but not getting any luck:

Image.frombytes('RGB', image['srcsize'], image['stream'].rawdata)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 2969, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 830, in frombytes
    raise ValueError(msg)
ValueError: not enough image data

Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?

My problem is solved by PyMuPDF: https://pymupdf.readthedocs.io/en/latest/document.html#Document.extract_image

jsvine commented 1 year ago

Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?

pdfplumber does not yet have strong support for this, so I think another library is likely best.

My problem is solved by PyMuPDF: https://pymupdf.readthedocs.io/en/latest/document.html#Document.extract_image

Good to know, and thanks for sharing!