Closed davincibj closed 6 years ago
Your output has extra spaces after each letter. Take a look at https://github.com/tesseract-ocr/tesseract/issues/1009 And see if that fixes the issue.
tesstrain.sh expects a standard name for word list (look at tesstrain_utils.sh) so, if you want to change word list, save the old one under a different name and update the list in langdata folder.
Thanks Shreeshirii! But let me explain the difference:
Your output has extra spaces after each letter. Take a look at #1009 And see if that fixes the issue. ----It's not the problem about space, it's the problem about segment characters into words. In English, 'T h i s i s a s a m p l e' can be combined to 'This is a sample' very well with the BEST ENG model, similar in Chinese(and other languages), '这 是 一 个 例 子’ should be combined to '这 是 (一个) ( 例子)‘, this can be done well with BEST CHI_SIM model. But with the training guide, I trained several models with finetune/cutlayer method, all are single characters recognized but not combined into words, I checked the training code and found it assigned default wordlist file, so it should output some words, but I can not found one word in all output. Looking for your further advice, thanks!
tesstrain.sh expects a standard name for word list (look at tesstrain_utils.sh) so, if you want to change word list, save the old one under a different name and update the list in langdata folder. -----I will try again.
@Shreeshrii About wordlist, I replaced the original one with my updated version, but seems no difference what ever the wordlilst---the output is still single character, not combine into meaningful word. My question is: lstmf files are generated before this step assigned wordlist file, since I can add more lstmf files in training_file.txt, so what is the meaning for the below step in training with wordlist file? === Constructing LSTM training data === [2018年 01月 11日 星期四 06:45:41 UTC] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset --script_dir ../langdata ## ### --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim_train_ccb --lang chi_sim Loaded unicharset of size 5012 from file /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset
lstmf files keeps the training image and corresponding text, based on the unicode font and training text used.
Wordlist is NOT a mandatory part of the language model, but can be helpful in improving recognition.
Even with your own wordlist, there is no feature, as far as I know, to restrict the OCRoutput to match ONLY words in a wordlist.
Regarding output of single characters, I already pointed out the config variable you can use so that words are demarcated rather than characters. It is up to you to try it.
Environment
root@59bd0bc3c863:/home/workspace/tesseract# tesseract --version tesseract 4.00.00alpha leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 Ubuntu 16.04 x86 64
Current Behavior:
I can train out a model with:
mkdir -p ../tesstutorial/chi_sim_train_ccb training/tesstrain.sh \ --fonts_dir /root/workspace/fonts_lib/ \ --lang chi_sim \ --linedata_only \ --noextract_font_properties \ --langdata_dir ../langdata \ --tessdata_dir /usr/local/share/tessdata \ --output_dir ../tesstutorial/chi_sim_train_ccb \ --training_text ../langdata/chi_sim/chi_sim_ccb.training_text \ --fontlist "SimSun" \
With the model I trained, all recognized text(Chinese text in my case) just like: ==>T h i s i s a s a m p l e 这 是 一 个 例 子 But with the model I downloaded from best, the result text just like: ==>This is a sample 这 是 一个 例子 I guess the dictionary maybe not be used, and I want to use customized dictionary while making training dataset, with this link: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh it said can point another wordlist:
[lang]/[lang].wordlist (alternatively this can be specified on the command line with --wordlist /path/to/wordlist)
So I use this to create dataset with my own wordlist(model_list.txt)
mkdir -p ../tesstutorial/chi_sim_train_ccb training/tesstrain.sh \ --fonts_dir /root/workspace/fonts_lib/ \ --lang chi_sim \ --linedata_only \ --noextract_font_properties \ --langdata_dir ../langdata \ --tessdata_dir /usr/local/share/tessdata \ --output_dir ../tesstutorial/chi_sim_train_ccb \ --training_text ../langdata/chi_sim/chi_sim_ccb.training_text \ --fontlist "SimSun" \ --wordlist ../langdata/chi_sim/model_list.txt \ #<===seems not work
The script ran successful but I found the following output lines:
=== Constructing LSTM training data === [2018年 01月 11日 星期四 06:45:41 UTC] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset --script_dir ../langdata ## ### --words ../langdata/chi_sim/(see here==>)chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim_train_ccb --lang chi_sim Loaded unicharset of size 5012 from file /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset
It seems still chi_sim.wordlist file used.
Expected Behavior:
Use my own wordlist to train
Suggested Fix:
promote which wordlist used