sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.02k stars 254 forks source link

Reading config file using tesserocr #233

Closed sanitha-studio closed 4 years ago

sanitha-studio commented 4 years ago

Config file has no impact with Tesserocr:

I am using tesseract 4.1.0 and whitelist(tried with black list too) works for me with pytesseract:

custom_config = r'-c tessedit_char_whitelist=abcd'
print(pytesseract.image_to_string(img, config=custom_config))

and tried with config file too:

print(pytesseract.image_to_string(img, config='letters'))

My test config file is simple as below:

tessedit_char_blacklist abcd

But it is NOT working with tesserocr:


with PyTessBaseAPI(psm=PSM.SINGLE_COLUMN, oem=OEM.LSTM_ONLY) as api:
api.SetImage(img)
api.SetVariable("tessedit_char_whitelist", "abcd");
api.Recognize()
print(api.GetUTF8Text())

below code also not working(tried with blacklist too):

with PyTessBaseAPI(psm=PSM.SINGLE_COLUMN, oem=OEM.LSTM_ONLY) as api:
    api.ReadConfigFile("letters")
    api.SetImage(img)
    api.Recognize()
    print(api.GetUTF8Text())

any other method to set config file with tesseocr? Seems it is not the issue with whiteList. I believe the issue is something about reading the config file using tesserocr. Please help!!!

sirfz commented 4 years ago

Does #174 answer your question?

sanitha-studio commented 4 years ago

Thanks for the response. BUT, It is said that whitelist and blacklist features are working from v4.1 and I made it works using PyTesseract(see my above comment). I have a problem with Tesserocr. Tried with both SetVariable and config file. Both way not working(See my code). Is it the right way to ReadConfigFile ! How can I test if config file itself is working with Tesserocr? I have used the config file to switch off the default dictionary and now I doubt whether that was also working and if config file has no impact while using tesseocr(api.ReadConfigFile("letters"))

sanitha-studio commented 4 years ago

I noticed one thing. The below code gives me Tesseract v4.0.0

import tesserocr
print(tesserocr.tesseract_version()) 

But actually I have uninstalled Tesseract V4.0 and now I have Tesseract v4.1.0

Is tesserocr using some in build Tesseract version ? May be this is the issue why blacklist and whitelist not working. help me with a solution. FYI: I am working in windows.

sirfz commented 4 years ago

I noticed one thing. The below code gives me Tesseract v4.0.0

There's your problem, tesserocr is compiled against tesseract v4.0.0. You have to re-install with the proper tesseract version.

sanitha-studio commented 4 years ago

I tried uninstalling tesserocr: pip uninstall tesserocr

  1. But when I checked import tesserocr it is NOT showing any error even after uninstalling
  2. now my system(windows) has Tesseract v4.1 and when I tried pip install tesserocr , it ended up with error like below:
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
      _LOGGER.warn('Failed to extract tesseract version number from: {}'.format(version))
    Failed to extract tesseract version number from: tesseract v4.1.0-elag2019
     leptonica-1.78.0
.
.
.
 tesserocr.cpp(634): fatal error C1083: Cannot open include file: 'leptonica/allheaders.h': No such file or directory
  1. do tesserocr is compatible with Tesseract v > 4.1 ?
sirfz commented 4 years ago

On Windows, the recommended installation method is via Conda. I'm not sure if it's already built against tesseract 4.1 (but possibly yes).

sanitha-studio commented 4 years ago

Installation done via conda and now the Tesseract version is V4.1.1 and the latest tesserocr got installed and everything works fine. Thank you.