manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

"No tesseract languages are available for use" #549

Closed mariomadproductions closed 2 years ago

mariomadproductions commented 2 years ago

I get this error when starting gImageReader, when either using the latest build from the ubuntu ppa or building the latest (cdffc47) from source (GTK build option). I have these packages installed via the ubuntu repos: tesseract-ocr libtesseract4 libtesseract5 tesseract-ocr-eng. I'm on Linux Mint 20.4 (its based on Ubuntu 20.04).

will7007 commented 2 years ago

Have you tried launching gImageReader from the command line? I was having a similar issue as you on Ubuntu 20.04 and got the following logs:

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages!

I have the same packages as you installed, but I noticed that eng.traineddata was only placed in /usr/share/tesseract-ocr/4.00/tessdata. I didn't have any success with export TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata (likely due to #405) so I just copied over eng.traineddata to 4.00/tessdata/ and it seems to be recognizing text fine.

Maybe it would be more proper to install a lower version of esseract-ocr-eng to see if it goes into the right place.

mariomadproductions commented 2 years ago

I get exactly the same error. Thanks to your advice, I made a symlink (instead of copying) from usr/share/tesseract-ocr/5/tessdata/eng.traineddata to /usr/share/tesseract-ocr/4.00/tessdata/eng.trainedata, and it work now. Its good to have a workaround even if this probably isn't the proper solution.

P.S. I think you made some typos in your post.

will7007 commented 2 years ago

sudo apt install tesseract-ocr-eng=1:4.00~git30-7274cfa-1 placed the eng.trainedata in the correct folder while using the latest release (or at least the version that I downloaded a few weeks ago when I posted my reply). It looks like the tesseract-ocr-eng version must match with the version of Tesseract that the installed version of gImageReader was built with (4.00).

Newer versions of gImageReader seems like they have switched over to using Tesseract 5.00 (at least, this is the version used when I compile from source with the Qt build option).

manisandro commented 2 years ago

Looks like tesseract build with an incorrect tessdata dir configuration. gImageReader does not have any tessdata dir detection logic, it relies on tesseract itself for listing available tessdatas.