tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

error related to script data during training #22

Closed Shreeshrii closed 5 years ago

Shreeshrii commented 5 years ago

https://github.com/tesseract-ocr/langdata_lstm/commit/02cc8f028532367dd44ba5fb3cbb6ac0bf73d6ad moved all script related data to script subfolder.

This leads to error/warnings during training, eg.

Wrote unicharset file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
[Wed Jun 19 18:46:20 UTC 2019] /usr/local/bin/set_unicharset_properties -U /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -O /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset -X /tmp/chi_sim-2019-06-19.fKD/chi_sim.xheights --script_dir=/home/ubuntu/langdata_lstm
Loaded unicharset of size 5090 from file /tmp/chi_sim-2019-06-19.fKD/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 《
Warning: properties incomplete for index 5 = 副

I do not know how important these properties are for LSTM and Legacy tesseract training.

@stweil What do you suggest to do in this case?

stweil commented 5 years ago

How can I reproduce this locally?

Shreeshrii commented 5 years ago

You can try:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata_lstm --maxpages 5 \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

Also see my PR at https://github.com/tesseract-ocr/tesseract/pull/2511 which is a proposed solution for this

Shreeshrii commented 5 years ago

Please use chi_sim instead of eng in the above command

stweil commented 5 years ago

text2image fails because several fonts are missing in my installation: AR PL UMing Patched Light, Arial Unicode MS, Arial Unicode MS Bold and WenQuanYi Zen Hei Medium. Do you know where I can get those fonts or can I replace them by different ones (Arial Unicode MS seems to be commercial)?

Shreeshrii commented 5 years ago

I had done the test run with only a few fonts by adding --fontlist 'Font 1' 'Font 2' to restrain.sh command.

Even with just one font which has coverage for Eng and Han characters, you should see the errors related to both unicharsets.

Shreeshrii commented 5 years ago

OK. At my PC now, here are the exact commands used and log...

#!/bin/bash

# arrow training text uses head -400, tail -500 from langdata_lstm training_text
# plus text overloaded with up and down arrows which are for adding to unicharset
# using only two fonts matching with okfonts

 rm -rf ~/tesstutorial/chi_sim_arrow

 ~/tesseract/src/training/tesstrain.sh \
 --fonts_dir ~/.fonts \
 --training_text ./chi_sim.arrow.training_text \
 --langdata_dir ~/langdata_lstm \
 --tessdata_dir ~/tessdata_best \
 --lang chi_sim --linedata_only \
 --noextract_font_properties  \
 --exposures "0" \
 --maxpages 0 \
 --workspace_dir ~/tmp \
 --save_box_tiff \
 --fontlist  \
"Arial Unicode MS" \
"WenQuanYi Zen Hei Medium" \
 --output_dir ~/tesstutorial/chi_sim_arrow

Relevant portion from console log:

=== Phase UP: Generating unicharset and unichar properties files ===
[Thu Jun 20 04:45:26 UTC 2019] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset --norm_mode 1 /tmp/chi_sim-2019-06-20.GWh/chi_sim.Arial_Unicode_MS.exp0.box /tmp/chi_sim-2019-06-20.GWh/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box
Extracting unicharset from box file /tmp/chi_sim-2019-06-20.GWh/chi_sim.Arial_Unicode_MS.exp0.box
Extracting unicharset from box file /tmp/chi_sim-2019-06-20.GWh/chi_sim.WenQuanYi_Zen_Hei_Medium.exp0.box
Wrote unicharset file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
[Thu Jun 20 04:45:31 UTC 2019] /usr/local/bin/set_unicharset_properties -U /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset -O /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset -X /tmp/chi_sim-2019-06-20.GWh/chi_sim.xheights --script_dir=/home/ubuntu/langdata_lstm
Loaded unicharset of size 4021 from file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 僧
Warning: properties incomplete for index 5 = 多
Warning: properties incomplete for index 6 = 粥
Warning: properties incomplete for index 7 = 少
Warning: properties incomplete for index 8 = -
Warning: properties incomplete for index 9 = 设
Warning: properties incomplete for index 10 = 备
Warning: properties incomplete for index 11 = 看
Warning: properties incomplete for index 12 = 对
Warning: properties incomplete for index 13 = 策
Warning: properties incomplete for index 14 = 各
Warning: properties incomplete for index 15 = 维
Warning: properties incomplete for index 16 = 权
Warning: properties incomplete for index 17 = 。
Warning: properties incomplete for index 18 = 脚
Warning: properties incomplete for index 19 = 钉
Warning: properties incomplete for index 20 = ↑
Warning: properties incomplete for index 21 = 切
Warning: properties incomplete for index 22 = 角
Warning: properties incomplete for index 23 = ↓
...
Warning: properties incomplete for index 4014 = 咀
Warning: properties incomplete for index 4015 = 闺
Warning: properties incomplete for index 4016 = 嘻
Warning: properties incomplete for index 4017 = 蝴
Warning: properties incomplete for index 4018 = 瑛
Warning: properties incomplete for index 4019 = 驿
Warning: properties incomplete for index 4020 = 硼
Writing unicharset to file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset

and

=== Constructing LSTM training data ===
[Thu Jun 20 04:51:53 UTC 2019] /usr/local/bin/combine_lang_model --input_unicharset /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset --script_dir /home/ubuntu/langdata_lstm --words /home/ubuntu/langdata_lstm/chi_sim/chi_sim.wordlist --numbers /home/ubuntu/langdata_lstm/chi_sim/chi_sim.numbers --puncs /home/ubuntu/langdata_lstm/chi_sim/chi_sim.punc --output_dir /home/ubuntu/tesstutorial/chi_sim_arrow --lang chi_sim
Loaded unicharset of size 4021 from file /tmp/chi_sim-2019-06-20.GWh/chi_sim.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Latin.unicharset
Failed to load script unicharset from:/home/ubuntu/langdata_lstm/Han.unicharset
Warning: properties incomplete for index 3 = “
Warning: properties incomplete for index 4 = 僧
Warning: properties incomplete for index 5 = 多
Warning: properties incomplete for index 6 = 粥
Warning: properties incomplete for index 7 = 少
Warning: properties incomplete for index 8 = -
Warning: properties incomplete for index 9 = 设
Warning: properties incomplete for index 10 = 备
...
Warning: properties incomplete for index 4012 = 觅
Warning: properties incomplete for index 4013 = 弄
Warning: properties incomplete for index 4014 = 咀
Warning: properties incomplete for index 4015 = 闺
Warning: properties incomplete for index 4016 = 嘻
Warning: properties incomplete for index 4017 = 蝴
Warning: properties incomplete for index 4018 = 瑛
Warning: properties incomplete for index 4019 = 驿
Warning: properties incomplete for index 4020 = 硼
Config file is optional, continuing...
Shreeshrii commented 5 years ago

Fonts source

https://packages.ubuntu.com/bionic/fonts/fonts-wqy-zenhei

http://wenq.org/wqy2/index.cgi?action=browse&id=Home&lang=en

Arial Unicode MS is available on Windows.

stweil commented 5 years ago

I am afraid that I moved too many files into the script subdirectory. Pull request #23 should fix that.

Shreeshrii commented 5 years ago

Thanks!