tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.88k stars 9.55k forks source link

I tried to OCR a PDF file with ver 4 on Windows 10 but returned:... #1476

Closed abdulbadii closed 6 years ago

abdulbadii commented 6 years ago

I tried to OCR a file "Kamus_Arab-Indonesia.pdf" - in English: "Arabic - Indonesia Dictionary".. so I typed from tesseract install dir:

tesseract.exe D:\DOC\ARABIC\Kamus_Arab-Indonesia.pdf  z:\t\Kamus.pdf  -l ara+ind --psm 1

Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

How could I solve mine such above?.. many thanks in advance.

stweil commented 6 years ago

Tesseract does not support reading PDF files.

You can try other software, for example OCRmyPDF.

ylluminarious commented 6 years ago

Apparently OCRmyPDF uses Tesseract under the hood, so I think that's important to note.