Closed Shreeshrii closed 5 years ago
How can I reproduce this locally?
You can try:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata_lstm --maxpages 5 \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
Also see my PR at https://github.com/tesseract-ocr/tesseract/pull/2511 which is a proposed solution for this
Please use chi_sim instead of eng in the above command
text2image
fails because several fonts are missing in my installation: AR PL UMing Patched Light
, Arial Unicode MS
, Arial Unicode MS Bold
and WenQuanYi Zen Hei Medium
. Do you know where I can get those fonts or can I replace them by different ones (Arial Unicode MS seems to be commercial)?
I had done the test run with only a few fonts by adding --fontlist 'Font 1' 'Font 2' to restrain.sh command.
Even with just one font which has coverage for Eng and Han characters, you should see the errors related to both unicharsets.
OK. At my PC now, here are the exact commands used and log...
#!/bin/bash
# arrow training text uses head -400, tail -500 from langdata_lstm training_text
# plus text overloaded with up and down arrows which are for adding to unicharset
# using only two fonts matching with okfonts
rm -rf ~/tesstutorial/chi_sim_arrow
~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--training_text ./chi_sim.arrow.training_text \
--langdata_dir ~/langdata_lstm \
--tessdata_dir ~/tessdata_best \
--lang chi_sim --linedata_only \
--noextract_font_properties \
--exposures "0" \
--maxpages 0 \
--workspace_dir ~/tmp \
--save_box_tiff \
--fontlist \
"Arial Unicode MS" \
"WenQuanYi Zen Hei Medium" \
--output_dir ~/tesstutorial/chi_sim_arrow
Relevant portion from console log:
=== Phase UP: Generating unicharset and unichar properties files ===
[Thu Jun 20 04:45:26 UTC 2019] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset --norm_mode 1 /tmp/chi_sim-2019-06-20.GWh/chi_sim.Arial_Unicode_MS.exp0.box /tmp/chi_sim-2019-06-20.GWh/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box
Extracting unicharset from box file /tmp/chi_sim-2019-06-20.GWh/chi_sim.Arial_Unicode_MS.exp0.box
Extracting unicharset from box file /tmp/chi_sim-2019-06-20.GWh/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box
Wrote unicharset file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
[Thu Jun 20 04:45:31 UTC 2019] /usr/local/bin/set_unicharset_properties -U /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset -O /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset -X /tmp/chi_sim-2019-06-20.GWh/chi_sim.xheights --script_dir=/home/ubuntu/langdata_lstm
Loaded unicharset of size 4021 from file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 僧
Warning: properties incomplete for index 5 = 多
Warning: properties incomplete for index 6 = 粥
Warning: properties incomplete for index 7 = 少
Warning: properties incomplete for index 8 = -
Warning: properties incomplete for index 9 = 设
Warning: properties incomplete for index 10 = 备
Warning: properties incomplete for index 11 = 看
Warning: properties incomplete for index 12 = 对
Warning: properties incomplete for index 13 = 策
Warning: properties incomplete for index 14 = 各
Warning: properties incomplete for index 15 = 维
Warning: properties incomplete for index 16 = 权
Warning: properties incomplete for index 17 = 。
Warning: properties incomplete for index 18 = 脚
Warning: properties incomplete for index 19 = 钉
Warning: properties incomplete for index 20 = ↑
Warning: properties incomplete for index 21 = 切
Warning: properties incomplete for index 22 = 角
Warning: properties incomplete for index 23 = ↓
...
Warning: properties incomplete for index 4014 = 咀
Warning: properties incomplete for index 4015 = 闺
Warning: properties incomplete for index 4016 = 嘻
Warning: properties incomplete for index 4017 = 蝴
Warning: properties incomplete for index 4018 = 瑛
Warning: properties incomplete for index 4019 = 驿
Warning: properties incomplete for index 4020 = 硼
Writing unicharset to file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
and
=== Constructing LSTM training data ===
[Thu Jun 20 04:51:53 UTC 2019] /usr/local/bin/combine_lang_model --input_unicharset /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset --script_dir /home/ubuntu/langdata_lstm --words /home/ubuntu/langdata_lstm/chi_sim/chi_sim.wordlist --numbers /home/ubuntu/langdata_lstm/chi_sim/chi_sim.numbers --puncs /home/ubuntu/langdata_lstm/chi_sim/chi_sim.punc --output_dir /home/ubuntu/tesstutorial/chi_sim_arrow --lang chi_sim
Loaded unicharset of size 4021 from file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 僧
Warning: properties incomplete for index 5 = 多
Warning: properties incomplete for index 6 = 粥
Warning: properties incomplete for index 7 = 少
Warning: properties incomplete for index 8 = -
Warning: properties incomplete for index 9 = 设
Warning: properties incomplete for index 10 = 备
...
Warning: properties incomplete for index 4012 = 觅
Warning: properties incomplete for index 4013 = 弄
Warning: properties incomplete for index 4014 = 咀
Warning: properties incomplete for index 4015 = 闺
Warning: properties incomplete for index 4016 = 嘻
Warning: properties incomplete for index 4017 = 蝴
Warning: properties incomplete for index 4018 = 瑛
Warning: properties incomplete for index 4019 = 驿
Warning: properties incomplete for index 4020 = 硼
Config file is optional, continuing...
Fonts source
https://packages.ubuntu.com/bionic/fonts/fonts-wqy-zenhei
http://wenq.org/wqy2/index.cgi?action=browse&id=Home&lang=en
Arial Unicode MS is available on Windows.
I am afraid that I moved too many files into the script
subdirectory. Pull request #23 should fix that.
Thanks!
https://github.com/tesseract-ocr/langdata_lstm/commit/02cc8f028532367dd44ba5fb3cbb6ac0bf73d6ad moved all script related data to
script
subfolder.This leads to error/warnings during training, eg.
I do not know how important these properties are for LSTM and Legacy tesseract training.
@stweil What do you suggest to do in this case?