Closed courao closed 5 years ago
Is xxx
a placeholder? Because the error message implies you're using a model tes
.
Have you placed your model data in the TESSDATA
directory for tesseract to find?
xxx means tes here, it's a model I trained with ocrd, actually it can give result for the input image in the output.txt file, but I still want know the reason for this problem and how to avoid it.
In addition, I compared generated .traineddata with the original one:
coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/eng.traineddata
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/tes.traineddata Version string:4.0.0-beta.4-138-g2093
17:lstm:size=4063123, offset=192
21:lstm-unicharset:size=1034, offset=4063315
22:lstm-recoder:size=148, offset=4064349
23:version:size=22, offset=4064497
I find generated one seems missing some dawg files, how to add them?
In addition to charset, traineddata can also contain information on punctuation, word lists etc. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#40000alpha-lstm-only-format We don't currently support those. Tesseract tries to load the dict, fails but still continues the recognition (https://digi.bib.uni-mannheim.de/tesseract/doc/tesseract-ocr.github.io/4.00.00dev/a01046_source.html#l00131).
@wrznr wontfix or helpwanted?
Neither! I guess we can fix this (i.e. support dictionaries). https://github.com/paalberti/tesseract-dan-fraktur/blob/master/deu_frak/buildscript.sh is a good starting point for augmenting the makefile.
Does this in any way affect the recognition? If I don't have these dawg files and all other files.
Of course, the recognition is heavily influenced by the existence (or non-existence) of dictionaries. Is this influence necessarily positive? I do not think so. Using dictionaries as hypotheses in text recognition bears the risk of introducing false positives (e.g. the German city Rust might be returned as the noun Rost if it is not in the dictionary). However, they were very important for the old, character-focused recognizer (tesseract version < 4) since they provided the necessary context for the single characters. With the line-focussed (lstm) approach, context information is implicitly provided by the model. Btw., I am not aware of any systematic evaluation of dictionary usage in OCR.
While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files. I have not tested the efficacy of OCR with vs without the dawg files.
Here is the section of bash script from a recent run for Arabic.
if [ $MergeData = "yes" ]; then
echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####"
combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang.
combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.
echo "#### build version string ####"
Version_Str="$Lang:ara`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version
echo "#### This cleans out all previous checkpoints for training ####"
rm -rf $layered_output_dir
mkdir -p $layered_output_dir
echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$BaseLang.lstm-unicharset \
$langdata_dir/$Lang/$Lang.zwnj.unicharset \
$layer_output_dir/$Lang/$Lang.unicharset \
$layered_output_dir/$Lang.continue.unicharset
echo "#### rebuild starter traineddata using the merged unicharset ####"
combine_lang_model \
--input_unicharset $layered_output_dir/$Lang.continue.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $layered_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $layer_output_dir/$Lang.new.version`
fi
Hi everyone, I have have the same issue:
Failed to load any lstm-specific dictionaries for lang modelb!! Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 575
Any update on this please?
If you need word/number/punctuation lists and have them available, you could adapt https://github.com/OCR-D/ocrd-train/blob/master/Makefile#L123-L127. See @Shreeshrii's sample code above.
While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files. I have not tested the efficacy of OCR with vs without the dawg files.
Here is the section of bash script from a recent run for Arabic.
if [ $MergeData = "yes" ]; then echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####" combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang. combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang. echo "#### build version string ####" Version_Str="$Lang:ara`date +%Y%m%d`:from:" sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version echo "#### This cleans out all previous checkpoints for training ####" rm -rf $layered_output_dir mkdir -p $layered_output_dir echo "#### merge unicharsets to ensure all existing chars are included ####" merge_unicharsets \ $bestdata_dir/$BaseLang.lstm-unicharset \ $langdata_dir/$Lang/$Lang.zwnj.unicharset \ $layer_output_dir/$Lang/$Lang.unicharset \ $layered_output_dir/$Lang.continue.unicharset echo "#### rebuild starter traineddata using the merged unicharset ####" combine_lang_model \ --input_unicharset $layered_output_dir/$Lang.continue.unicharset \ --script_dir $langdata_dir \ --words $langdata_dir/$Lang/$Lang.wordlist \ --numbers $langdata_dir/$Lang/$Lang.numbers \ --puncs $langdata_dir/$Lang/$Lang.punc \ --output_dir $layered_output_dir \ --pass_through_recoder \ --lang_is_rtl \ --lang $Lang \ --version_str ` cat $layer_output_dir/$Lang.new.version` fi
Thank you very much for your fantastic work on OCRD-train. I really appreciate it. I am doing tesseract ocr training recently and OCRD-train helped a lot.
I also met the same issue:
Failed to load any lstm-specific dictionaries for lang MYMODEL(placeholder)!!
and this sample code seems a bit confused to me. I assume $BaseLang
is the one we started training, like in my case is the eng.traineddata
, and the $Lang
is our model. why we will have this $Lang.traineddata in bestdata_dir
even before training? Or this piece of code should happen after training? the only purpose is to merge the data so we will have numbers/puncs/wordlist for our model? then what we should do is to move our trained model to $bestdata_dir
if we want to follow the code?
also I am not quite sure about $layer_output_dir
and $layered_output_dir
? I know this sample code is in another scenario maybe, but could you please point out the equivalent directories in Makefile
.
and for the starter traineddata(ex, eng.traineddata in my case), I have all these files ready lstm-punc-dawg/lstm-word-dawg/lstm-number-dawg/, seems Makefile
has already automatically did this for us.
Any suggestions or hints would be greatly appreciated.
As mentioned in https://github.com/OCR-D/ocrd-train/issues/28#issuecomment-459250817
You need to modify the combine_lang_model
command.
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir data/ \
--output_dir data/ \
--lang $(MODEL_NAME)
You need to add the following to the above with paths to where your wordlist, punc and numbers files are.
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
The reason I had different BaseLang and Lang was- BaseLang=Arabic Lang=ara
Arabic.traineddata had better recognition for Arabic punctuation
, so I wanted to use the lstm
file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones for ara
.
combine_lang_model --help
should display the syntax.
As mentioned in #28 (comment) You need to modify the
combine_lang_model
command.combine_lang_model \ --input_unicharset data/unicharset \ --script_dir data/ \ --output_dir data/ \ --lang $(MODEL_NAME)
You need to add the following to the above with paths to where your wordlist, punc and numbers files are.
--words $langdata_dir/$Lang/$Lang.wordlist \ --numbers $langdata_dir/$Lang/$Lang.numbers \ --puncs $langdata_dir/$Lang/$Lang.punc \
The reason I had different BaseLang and Lang was- BaseLang=Arabic Lang=ara
Arabic.traineddata had better recognition for
Arabic punctuation
, so I wanted to use thelstm
file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones forara
.
combine_lang_model --help
should display the syntax.
It's working now. Thank you very much!
I met this problem after training a model with OCRD, in the terminal I input:
tesseract 5.2.tif output --psm 7 -l xxx
and I get this message:Anyone can help?