pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

Not reading the pdf file #851

Open drnko opened 1 year ago

drnko commented 1 year ago

Bug report

Whenever I'm converting an image to PDF and trying to extract the text from the converted PDF, the result from PDFplumber is blank.

What I'm doing wrong?

Step 1:

Step 2:

===============================================================

Below is the code:

image_1 = Image.open(r'D:\ocr\images\barrel.jpg') im_1 = image_1.convert('RGB') im_1.save(r'test.pdf')

inv_pdf = pdfplumber.open('test.pdf') print('Result:' , inv_pdf.pages[0].extract_text())

=============================================================== Terminal:

PS D:\ocr> & "C:/Program Files/Python310/python.exe" d:/ocr/testing.py Result:

PS D:\GitOCR\ocr>

===============================================================

Below are the files converted PDF files from image file:

test.pdf

test1.pdf

test2.pdf

pettzilla1 commented 1 year ago

It’s probably worth reporting this on pdfumbers GitHub, this is pdfminer.six

drnko commented 1 year ago

Oh! Sorry

jcallaha commented 1 year ago

pdfplumber doesn’t do OCR (optical character recognition) - what you have done is just create a PDF with an image and no text. If you are starting with an image and want the “text” on the image you should look at Tesseract or services like AWS Textract. Good luck!