openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Error opening chinese data file #70

Closed bclyc closed 7 years ago

bclyc commented 7 years ago

I got this error:

TesseractError: (1, 'Error opening data file /usr/workspace/tesseract/chi-sim.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.\nFailed loading language \'chi-sim\'\nTesseract couldn\'t load any languages!\nCould not initialize tesseract.\n')

Then I tried eng, fra traineddata file and all went well.

And it took me a long time to find out that it was the naming problem. Atfer I changed the filename from "chi-sim.traineddata" to "chi.traineddata" and changed them in programs, all went ok.I guess it's because pyocr have problem reading data file with "-" in its name. However official tesseract doesn't have this issue.

Please fix this, thank you!

jflesch commented 7 years ago
from pprint import pprint
import pyocr

t = pyocr.get_available_tools()[0]  # pyocr.tesseract

pprint(t.get_available_languages())
# ['chi_sim',
# (...)
# 'chi_tra',
# (...)
# 'deu-frak',
# (...)
# 'afr']

t.image_to_string(img, lang="chi_sim")  # works for me
t.image_to_string(img, lang="deu-frak")  # works for me too

t.image_to_string(img, lang="chi-sim")  # fails
# TesseractError: (1, b'Tesseract Open Source OCR Engine v3.03 with Leptonica\nError opening data file
#  /usr/share/tesseract-ocr/tessdata/chi-sim.traineddata\nPlease make sure the TESSDATA_PREFIX
# environment variable is set to the parent directory of your "tessdata" directory.\n
# Failed loading language \'chi-sim\'\nTesseract couldn\'t load any languages!\n
# Could not initialize tesseract.\n')
t.image_to_string(img, lang="deu_frak")  # fails

For me, it is not a bug in PyOCR. It is the expected behavior.