tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Failed Loading Language/Cannot find LSTM-specific dictionaries #155

Closed TheSYNcoder closed 4 years ago

TheSYNcoder commented 4 years ago

I have been training a sample model TESS using tesstrain and the training went fine . However after training when i move the /data/TESS/TESS.traineddata to /usr/local/share and run tesseract image.tif out -l TESS I get the following error

Error: Tesseract (legacy) engine requested, but components are not present in /usr/local/share/tessdata/TESS.traineddata!!
Failed loading language 'TESS'

On the other hand , when i move the /data/TESS.traineddata it gives me the following error on running the same command :

Failed to load any lstm-specific dictionaries for lang TESS!!

Am i doing something wrong after the training ,can anyone please help , if it may help , here's my tesseract version

tesseract 5.0.0-alpha-648-gcdebe
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found OpenMP 201511
 Found libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Shreeshrii commented 4 years ago

data/TESS/TESS.traineddata is the starter traineddata created with the unicharset from training text. It's size should be small. It can't be used for recognition.

data/TESS.traineddata is the traineddata after training. If you didn't have wordlist, you will get a warning about missing dictionary.

Check the timestamps and file sizes. The larger and later file will be your traineddata file.

livezingy commented 4 years ago

@TheSYNcoder

  1. the /data/TESS.traineddata should be the right one.
  2. The reason of the error: Failed to load any lstm-specific dictionaries for lang TESS!! Please refer to here:Failed to load any lstm-specific dictionaries for lang xxx

Although the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE are Optional in makefile, traineddata can also contain information on punctuation, word lists etc when training. If lack of these files ,the training traineddata will give this error when called.

  1. You may could try the following steps to solve it:

3.1 Find the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE in the makefile, and change them to:

WORDLIST_FILE := data/$(MODEL_NAME).wordlist
NUMBERS_FILE := data/$(MODEL_NAME).numbers 
PUNC_FILE := data/$(MODEL_NAME).punc

3.2 Suppose your base traineddata is eng.traineddata or your language is english. Download the .wordlist/.numbers/.punc files from the tesseract-ocr/langdata_lstm/eng, and Rename them as TESS.wordlist, TESS.numbers, TESS.punc, then place them to /data/.

3.3 make training again.

@Shreeshrii I think that there may be a bug about the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE in makefile.

In tesstrain, the default path of the above WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE is $ (OUTPUT_DIR) = data / $ (MODEL_NAME), and all files in this path are automatically generated during the training process.

If the variable START_MODEL is not assigned, the makefile will not generate any related files under this path;

If the variable START_MODEL has been assigned, the foo.lstm-number-dawg、foo.lstm-punc-dawg、foo.lstm-word-dawg and so on will be produced in data / $ (MODEL_NAME). But they are not the right files the traineddata needed, the traineddata need the .wordlist/.numbers/.punc files. So there may be a bug in in tesstrain/makefile

Am I right Please?

wrznr commented 4 years ago

@TheSYNcoder Please move TESS.traineddata to /usr/local/share/tessdata/ (as indicated by the error message).

It is save to ignore the message Failed to load any lstm-specific dictionaries for lang TESS!!, dictionaries are an optional addition to tesseract models. Personally, I never use them when training my own models. I do not see any benefits.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.