davincibj commented 6 years ago

Environment

root@59bd0bc3c863:/home/workspace/tesseract# tesseract --version tesseract 4.00.00alpha leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 Ubuntu 16.04 x86 64

Current Behavior:

I can train out a model with:

mkdir -p ../tesstutorial/chi_sim_train_ccb training/tesstrain.sh \ --fonts_dir /root/workspace/fonts_lib/ \ --lang chi_sim \ --linedata_only \ --noextract_font_properties \ --langdata_dir ../langdata \ --tessdata_dir /usr/local/share/tessdata \ --output_dir ../tesstutorial/chi_sim_train_ccb \ --training_text ../langdata/chi_sim/chi_sim_ccb.training_text \ --fontlist "SimSun" \

Training-Tesseract-%E2%80%93-tesstrain.sh it said can point another wordlist:

[lang]/[lang].wordlist (alternatively this can be specified on the command line with --wordlist /path/to/wordlist)

So I use this to create dataset with my own wordlist(model_list.txt)

mkdir -p ../tesstutorial/chi_sim_train_ccb training/tesstrain.sh \ --fonts_dir /root/workspace/fonts_lib/ \ --lang chi_sim \ --linedata_only \ --noextract_font_properties \ --langdata_dir ../langdata \ --tessdata_dir /usr/local/share/tessdata \ --output_dir ../tesstutorial/chi_sim_train_ccb \ --training_text ../langdata/chi_sim/chi_sim_ccb.training_text \ --fontlist "SimSun" \ --wordlist ../langdata/chi_sim/model_list.txt \ #<===seems not work

The script ran successful but I found the following output lines:

=== Constructing LSTM training data === [2018年 01月 11日星期四 06:45:41 UTC] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset --script_dir ../langdata ## ### --words ../langdata/chi_sim/(see here==>)chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim_train_ccb --lang chi_sim Loaded unicharset of size 5012 from file /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset

It seems still chi_sim.wordlist file used.

Expected Behavior:

Use my own wordlist to train

Suggested Fix:

promote which wordlist used

Shreeshrii commented 6 years ago

Your output has extra spaces after each letter. Take a look at https://github.com/tesseract-ocr/tesseract/issues/1009 And see if that fixes the issue.
tesstrain.sh expects a standard name for word list (look at tesstrain_utils.sh) so, if you want to change word list, save the old one under a different name and update the list in langdata folder.

davincibj commented 6 years ago

Thanks Shreeshirii! But let me explain the difference:

Your output has extra spaces after each letter. Take a look at #1009 And see if that fixes the issue. ----It's not the problem about space, it's the problem about segment characters into words. In English, 'T h i s i s a s a m p l e' can be combined to 'This is a sample' very well with the BEST ENG model, similar in Chinese(and other languages), '这是一个例子’ should be combined to '这是 (一个) ( 例子)‘, this can be done well with BEST CHI_SIM model. But with the training guide, I trained several models with finetune/cutlayer method, all are single characters recognized but not combined into words, I checked the training code and found it assigned default wordlist file, so it should output some words, but I can not found one word in all output. Looking for your further advice, thanks!
tesstrain.sh expects a standard name for word list (look at tesstrain_utils.sh) so, if you want to change word list, save the old one under a different name and update the list in langdata folder. -----I will try again.

davincibj commented 6 years ago

@Shreeshrii About wordlist, I replaced the original one with my updated version, but seems no difference what ever the wordlilst---the output is still single character, not combine into meaningful word. My question is: lstmf files are generated before this step assigned wordlist file, since I can add more lstmf files in training_file.txt, so what is the meaning for the below step in training with wordlist file? === Constructing LSTM training data === [2018年 01月 11日星期四 06:45:41 UTC] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset --script_dir ../langdata ## ### --words ../langdata/chi_sim/chi_sim.wordlist --numbers ../langdata/chi_sim/chi_sim.numbers --puncs ../langdata/chi_sim/chi_sim.punc --output_dir ../tesstutorial/chi_sim_train_ccb --lang chi_sim Loaded unicharset of size 5012 from file /tmp/tmp.VZHJNc6Cd7/chi_sim/chi_sim.unicharset

Shreeshrii commented 6 years ago

lstmf files keeps the training image and corresponding text, based on the unicode font and training text used.

Wordlist is NOT a mandatory part of the language model, but can be helpful in improving recognition.

Even with your own wordlist, there is no feature, as far as I know, to restrict the OCRoutput to match ONLY words in a wordlist.

Regarding output of single characters, I already pointed out the config variable you can use so that words are demarcated rather than characters. It is up to you to try it.

tesseract-ocr / tesseract

Defined wordlist but seems to be ignored #1267

Environment

Current Behavior:

I can train out a model with:

[lang]/[lang].wordlist (alternatively this can be specified on the command line with --wordlist /path/to/wordlist)

So I use this to create dataset with my own wordlist(model_list.txt)

The script ran successful but I found the following output lines:

Expected Behavior:

Suggested Fix: