tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Failed to load any lstm-specific dictionaries for lang xxx #28

Closed courao closed 5 years ago

courao commented 6 years ago

I met this problem after training a model with OCRD, in the terminal I input: tesseract 5.2.tif output --psm 7 -l xxx and I get this message:

Failed to load any lstm-specific dictionaries for lang tes!!
Tesseract Open Source OCR Engine v4.0.0-beta.4-138-g2093 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Anyone can help?

kba commented 6 years ago

Is xxx a placeholder? Because the error message implies you're using a model tes.

Have you placed your model data in the TESSDATA directory for tesseract to find?

courao commented 6 years ago

xxx means tes here, it's a model I trained with ocrd, actually it can give result for the input image in the output.txt file, but I still want know the reason for this problem and how to avoid it.

courao commented 6 years ago

In addition, I compared generated .traineddata with the original one:

coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/eng.traineddata  
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/tes.traineddata Version string:4.0.0-beta.4-138-g2093
17:lstm:size=4063123, offset=192
21:lstm-unicharset:size=1034, offset=4063315
22:lstm-recoder:size=148, offset=4064349
23:version:size=22, offset=4064497

I find generated one seems missing some dawg files, how to add them?

kba commented 6 years ago

In addition to charset, traineddata can also contain information on punctuation, word lists etc. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#40000alpha-lstm-only-format We don't currently support those. Tesseract tries to load the dict, fails but still continues the recognition (https://digi.bib.uni-mannheim.de/tesseract/doc/tesseract-ocr.github.io/4.00.00dev/a01046_source.html#l00131).

@wrznr wontfix or helpwanted?

wrznr commented 6 years ago

Neither! I guess we can fix this (i.e. support dictionaries). https://github.com/paalberti/tesseract-dan-fraktur/blob/master/deu_frak/buildscript.sh is a good starting point for augmenting the makefile.

marisancans commented 5 years ago

Does this in any way affect the recognition? If I don't have these dawg files and all other files.

wrznr commented 5 years ago

Of course, the recognition is heavily influenced by the existence (or non-existence) of dictionaries. Is this influence necessarily positive? I do not think so. Using dictionaries as hypotheses in text recognition bears the risk of introducing false positives (e.g. the German city Rust might be returned as the noun Rost if it is not in the dictionary). However, they were very important for the old, character-focused recognizer (tesseract version < 4) since they provided the necessary context for the single characters. With the line-focussed (lstm) approach, context information is implicitly provided by the model. Btw., I am not aware of any systematic evaluation of dictionary usage in OCR.

Shreeshrii commented 5 years ago

While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files. I have not tested the efficacy of OCR with vs without the dawg files.

Here is the section of bash script from a recent run for Arabic.

if [ $MergeData = "yes" ]; then

echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####"
combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang.
combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.

echo "#### build version string ####"
Version_Str="$Lang:ara`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version

echo "#### This cleans out all previous checkpoints for training ####"
rm -rf $layered_output_dir
mkdir -p  $layered_output_dir

echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$BaseLang.lstm-unicharset \
$langdata_dir/$Lang/$Lang.zwnj.unicharset \
$layer_output_dir/$Lang/$Lang.unicharset \
$layered_output_dir/$Lang.continue.unicharset

echo "#### rebuild starter traineddata using the merged unicharset ####"
combine_lang_model \
--input_unicharset    $layered_output_dir/$Lang.continue.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $layered_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $layer_output_dir/$Lang.new.version`

fi
trinitybest commented 5 years ago

Hi everyone, I have have the same issue: Failed to load any lstm-specific dictionaries for lang modelb!! Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 575 Any update on this please?

kba commented 5 years ago

If you need word/number/punctuation lists and have them available, you could adapt https://github.com/OCR-D/ocrd-train/blob/master/Makefile#L123-L127. See @Shreeshrii's sample code above.

cooleel commented 5 years ago

While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files. I have not tested the efficacy of OCR with vs without the dawg files.

Here is the section of bash script from a recent run for Arabic.

if [ $MergeData = "yes" ]; then

echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####"
combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang.
combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.

echo "#### build version string ####"
Version_Str="$Lang:ara`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version

echo "#### This cleans out all previous checkpoints for training ####"
rm -rf $layered_output_dir
mkdir -p  $layered_output_dir

echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$BaseLang.lstm-unicharset \
$langdata_dir/$Lang/$Lang.zwnj.unicharset \
$layer_output_dir/$Lang/$Lang.unicharset \
$layered_output_dir/$Lang.continue.unicharset

echo "#### rebuild starter traineddata using the merged unicharset ####"
combine_lang_model \
--input_unicharset    $layered_output_dir/$Lang.continue.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $layered_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $layer_output_dir/$Lang.new.version`

fi

Thank you very much for your fantastic work on OCRD-train. I really appreciate it. I am doing tesseract ocr training recently and OCRD-train helped a lot. I also met the same issue: Failed to load any lstm-specific dictionaries for lang MYMODEL(placeholder)!! and this sample code seems a bit confused to me. I assume $BaseLang is the one we started training, like in my case is the eng.traineddata, and the $Lang is our model. why we will have this $Lang.traineddata in bestdata_dir even before training? Or this piece of code should happen after training? the only purpose is to merge the data so we will have numbers/puncs/wordlist for our model? then what we should do is to move our trained model to $bestdata_dir if we want to follow the code? also I am not quite sure about $layer_output_dir and $layered_output_dir? I know this sample code is in another scenario maybe, but could you please point out the equivalent directories in Makefile.

and for the starter traineddata(ex, eng.traineddata in my case), I have all these files ready lstm-punc-dawg/lstm-word-dawg/lstm-number-dawg/, seems Makefile has already automatically did this for us.

Any suggestions or hints would be greatly appreciated.

Shreeshrii commented 5 years ago

As mentioned in https://github.com/OCR-D/ocrd-train/issues/28#issuecomment-459250817 You need to modify the combine_lang_model command.

    combine_lang_model \
      --input_unicharset data/unicharset \
      --script_dir data/ \
      --output_dir data/ \
      --lang $(MODEL_NAME)

You need to add the following to the above with paths to where your wordlist, punc and numbers files are.

--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \

The reason I had different BaseLang and Lang was- BaseLang=Arabic Lang=ara

Arabic.traineddata had better recognition for Arabic punctuation, so I wanted to use the lstm file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones for ara.

combine_lang_model --help should display the syntax.

cooleel commented 5 years ago

As mentioned in #28 (comment) You need to modify the combine_lang_model command.

  combine_lang_model \
    --input_unicharset data/unicharset \
    --script_dir data/ \
    --output_dir data/ \
    --lang $(MODEL_NAME)

You need to add the following to the above with paths to where your wordlist, punc and numbers files are.

--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \

The reason I had different BaseLang and Lang was- BaseLang=Arabic Lang=ara

Arabic.traineddata had better recognition for Arabic punctuation, so I wanted to use the lstm file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones for ara.

combine_lang_model --help should display the syntax.

It's working now. Thank you very much!