Not reading the pdf file

drnko commented 1 year ago

Bug report

Whenever I'm converting an image to PDF and trying to extract the text from the converted PDF, the result from PDFplumber is blank.

What I'm doing wrong?

Step 1:

Converting an image(jpeg/jpg/png) to PDF using the PIL
Saving the converted pdf file.

Step 2:

Open converted/saved pdf using pdfplumber.open()
Extracting text from the loaded/opened pdf file

===============================================================

Below is the code:

image_1 = Image.open(r'D:\ocr\images\barrel.jpg') im_1 = image_1.convert('RGB') im_1.save(r'test.pdf')

inv_pdf = pdfplumber.open('test.pdf') print('Result:' , inv_pdf.pages[0].extract_text())

=============================================================== Terminal:

PS D:\ocr> & "C:/Program Files/Python310/python.exe" d:/ocr/testing.py Result:

PS D:\GitOCR\ocr>

===============================================================

Below are the files converted PDF files from image file:

test.pdf

test1.pdf

test2.pdf

pettzilla1 commented 1 year ago

It’s probably worth reporting this on pdfumbers GitHub, this is pdfminer.six

drnko commented 1 year ago

Oh! Sorry

jcallaha commented 1 year ago

pdfplumber doesn’t do OCR (optical character recognition) - what you have done is just create a PDF with an image and no text. If you are starting with an image and want the “text” on the image you should look at Tesseract or services like AWS Textract. Good luck!

pdfminer / pdfminer.six

Not reading the pdf file #851