tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.47k stars 9.32k forks source link

Failed to load language error when using multiple langages for recognition #4284

Open captain-yoshi opened 1 month ago

captain-yoshi commented 1 month ago

Current Behavior

I get an error when trying to read a text from this image :

50uL

$ tesseract 50uL.png - -l eng+ell

Error opening data file /usr/share/tesseract-ocr/5/tessdata/grc.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'grc'
Volume: 50 pl

$ ls /usr/share/tesseract-ocr/5/tessdata/
configs  ell.traineddata  eng.traineddata  pdf.ttf  tessconfigs

Using datasets from tessdata_best.

Expected Behavior

I would expect to be able to use multiple langages like stated in the Tesseract documentation.

Suggested Fix

Is there a better way to recognize the μ greek letter when used in English texts ? Maybe I have to train a new dataset...

tesseract -v

tesseract 5.4.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

Ubuntu 24.04 Noble

Other Operating System

No response

uname -a

Linux captain-yoshi 5.15.0-113-generic #123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

Intel(R) Core(TM) i7-7700HQ

Virtualization / Containers

No response

Other Information

No response

captain-yoshi commented 1 month ago

Ok found my problem. The ell langage has a dependency for grc. All is good now :)

micro-test

Using

$ tesseract micro-test.png - -l eng+grc

Is there a better way to recognize the μ greek letter when used in English texts ? Maybe | have to train a
new dataset...