Trying to OCR a jpeg but getting [Error 3221225477]?

Helyux commented 6 years ago

Hello, any Idea what the following error means? i didn't find anything except this which didn't help me narrow it down.

Code Snippet:

# Read in pdf and Convert to jpeg #
image_pdf = Image(filename=filepath, resolution=350)

#ONLY GET THE FIRST SITE FROM THE PDF#
extractedfirstsite = image_pdf.sequence[0]
firstimage = Image(image=extractedfirstsite)
image_jpeg = firstimage.convert('jpeg')

# Append Image Blobs to List #
for img in image_jpeg.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('jpeg'))

# OCR every image blob and append found text to List #
for img in req_image: 
    txt = tool.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    final_text.append(txt)

The corresponding Error:

File "C:\Program Files (x86)\Python36-32\lib\site-packages\pyocr\tesseract.py", line 367, in image_to_string 
    raise TesseractError(status, errors)
        pyocr.error.TesseractError: (3221225477, b'')

I tested the general functionality of tesseract and it works as expected.

C:\>tesseract test.jpg out
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

Im running:

Windows 10 (64 Bit)
Python 3.6.5 (32 Bit)
Tesseract (unofficial installer for windows for Tesseract 4.00-dev)
ImageMagick 6.9.9-40 Q8 (32 Bit)
Wand and PIL (respectively 32 Bit)

Any help would be appreciated.

jflesch commented 6 years ago

Can you try the following please:

C:\>tesseract test.jpg out
(...)
C:\>echo %ERRORLEVEL%

?

Helyux commented 6 years ago

Sure:

C:\>tesseract test.jpg out
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

C:\>echo %ERRORLEVEL%
0

jflesch commented 6 years ago

And above the Python error message, you have no other message ?

Helyux commented 6 years ago

Well Full Traceback would be:

Traceback (most recent call last):
  File "C:\Users\dummy\Desktop\_Core.py", line 247, in ocrpdf
    params = OcrPDF.ocr(qpath)
  File "C:\Users\dummy\Desktop\OcrPDF.py", line 68, in ocr
    builder=pyocr.builders.TextBuilder()
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\pyocr\tesseract.py", line 367, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (3221225477, b'')

Which shouldn't be relevant.

jflesch commented 6 years ago

Hmm. The only thing clear here is that Tesseract return an error (error code != 0) with no output at all on stdout/stderr. The error code returned is 3221225477 --> 0xC0000005 : ACCESS_VIOLATION. In other words, Tesseract has crashed.

I cannot figure out anything more at this point.

Anyway, AFAIK Tesseract 4.00 is still alpha. Have you tried with Tesseract 3.05.xx ?

Helyux commented 6 years ago

Found a mistake on my side, am deeply sorry!

For anyone wondering: I didn't notice i've installed Tesseract (unofficial installer for windows for Tesseract 4.00-dev) in the (new) 64 Bit Version.

jflesch commented 6 years ago

No problem

openpaperwork / pyocr

Trying to OCR a jpeg but getting [Error 3221225477]? #97