Closed willzgr closed 1 year ago
obtaining pictures in PDF
I would like to add features for more easily accessing/saving images embedded in PDFs. I cannot, however, guarantee that such features will be available in the next version of pdfplumber
.
and manipulating pictures
pdfplumber
is focused solely on extracting information from PDFs, and has no support (current or planned) for manipulating PDFs (or the images inside them).
as well as the function of substituting text by obtaining the coordinates of pictures?
I don't think I fully understand this part of the question. But you can currently access the coordinates of images embedded in PDFs, via the attributes provided in the page.images
objects
Thank you very much for your reply, I would like to explain my question again in detail. Now I have a need to convert PDF containing pictures and text into pure documents, so I have an idea that I can use the library of pdfplumber to obtain the location of pictures in PDF, OCR technology or other means. It is ok to convert pictures into text and put them back in the original position, and then use other PDF libraries to turn them into pure documents. However, it seems that the library of pdfplumber does not support to know whether the object of the coordinate is a picture, text or other according to the coordinate, nor can it replace the content. Maybe I haven't found out how to call it yet.
Hi @willzgr, you are correct: pdfplumber
cannot replace content; it is focused only on reading PDFs and extracting information from them, rather than creating or editing them.
As for coordinates: The page.images
property will return all images in the PDF, although this will only work for original digital PDFs, rather than rasterized or scanned PDFs that have been OCR-ed.
Hi @jsvine I wonder if extracting images is available in pdfplumber now? I am trying to get a pillow image some codes like this but not getting any luck:
Image.frombytes('RGB', image['srcsize'], image['stream'].rawdata)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 2969, in frombytes
im.frombytes(data, decoder_name, args)
File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 830, in frombytes
raise ValueError(msg)
ValueError: not enough image data
Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?
Hi @jsvine I wonder if extracting images is available in pdfplumber now? I am trying to get a pillow image some codes like this but not getting any luck:
Image.frombytes('RGB', image['srcsize'], image['stream'].rawdata)
Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 2969, in frombytes im.frombytes(data, decoder_name, args) File "/home/user/miniconda3/envs/diamondforce/lib/python3.10/site-packages/PIL/Image.py", line 830, in frombytes raise ValueError(msg) ValueError: not enough image data
Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?
My problem is solved by PyMuPDF: https://pymupdf.readthedocs.io/en/latest/document.html#Document.extract_image
Is it possible to convert image objects of pdfplumber into pillow images now? Or should I look into some other python libraries?
pdfplumber
does not yet have strong support for this, so I think another library is likely best.
My problem is solved by PyMuPDF: https://pymupdf.readthedocs.io/en/latest/document.html#Document.extract_image
Good to know, and thanks for sharing!
May I ask if the next version supports the function of obtaining pictures in PDF and manipulating pictures, as well as the function of substituting text by obtaining the coordinates of pictures?