tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.93k stars 9.48k forks source link

lstm fine tune +/- chars - segmentation fault when unicharset edited #1283

Open wosiu opened 6 years ago

wosiu commented 6 years ago

Tesseract version: current master (c9169e5a), also tested on ebbfc3ae8df85c mode: LSTM

TL;DR It seems there is something wrong during lstmtraining if any line is removed from original unicharset.

I'm trying to fine tune digitsbest.traineddata model. I use following flow of commands:

============ CASE 1 - it works ==============

For extracting digitsbest.lstm :

combine_tessdata -u ../digitsbest.traineddata digitsbest2.

For training:

lstmtraining --model_output liczniki_model \
  --traineddata ../tessdata/digitsbest.traineddata \
  --train_listfile liczniki.training_files.txt \
  --max_iterations 4000 --target_error_rate 0.01 \
  --continue_from ../tessdata/digitsbest_uncombined/digitsbest2.lstm 

All good.

============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning, so FROM:

20
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
+ 0 64,91,180,222,1.01629,0.208909,0.146983,0.129428,1.11175,0.248757 Common 3 3 3 +    # + [2b ]
1 8 64,70,205,251,0.638677,0.167108,0.214363,0.14141,0.838072,0.180313 Common 4 2 4 1   # 1 [31 ]0
- 10 107,126,138,158,0.568514,0.130816,0.112746,0.117257,0.664044,0.160051 Common 5 3 5 -   # - [2d ]p
2 8 64,65,208,254,0.96915,0.143008,0.112364,0.133742,1.07969,0.142624 Common 6 2 6 2    # 2 [32 ]0
0 8 64,65,208,254,1.02998,0.140967,0.103248,0.0956228,1.12365,0.142766 Common 7 2 7 0   # 0 [30 ]0
5 8 58,65,209,254,0.954893,0.141873,0.129152,0.127772,1.06566,0.143759 Common 8 2 8 5   # 5 [35 ]0
6 8 64,65,221,255,0.994537,0.124062,0.124553,0.112235,1.09686,0.140651 Common 9 2 9 6   # 6 [36 ]0
3 8 58,65,209,254,0.949504,0.138527,0.123931,0.137766,1.06688,0.149633 Common 10 2 10 3 # 3 [33 ]0
9 8 58,68,209,254,0.993301,0.125705,0.156308,0.135153,1.11077,0.137772 Common 11 2 11 9 # 9 [39 ]0
8 8 64,65,222,255,1.00021,0.13277,0.106435,0.108115,1.09598,0.137484 Common 12 2 12 8   # 8 [38 ]0
7 8 58,70,207,254,0.946542,0.126568,0.154401,0.11639,1.05141,0.15142 Common 13 2 13 7   # 7 [37 ]0
( 10 13,54,228,255,0.594727,0.164642,0.156593,0.114807,0.658837,0.140379 Common 14 10 17 (  # ( [28 ]p
4 8 59,69,209,253,1.01905,0.124317,0.100678,0.119407,1.11185,0.14632 Common 15 2 15 4   # 4 [34 ]0
. 10 64,70,93,115,0.319569,0.101374,0.117354,0.11857,0.436319,0.127161 Common 16 6 16 . # . [2e ]p
) 10 13,57,228,255,0.594787,0.161932,0.063558,0.147954,0.665259,0.143481 Common 17 10 14 )  # ) [29 ]p
: 10 64,71,169,193,0.371572,0.123684,0.121114,0.11078,0.495559,0.113143 Common 18 6 18 :    # : [3a ]p
, 10 24,47,92,115,0.362396,0.0896366,0.0823997,0.128016,0.443343,0.117966 Common 19 6 19 ,  # , [2c ]p

TO:

13
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
1 8 64,70,205,251,0.638677,0.167108,0.214363,0.14141,0.838072,0.180313 Common 4 2 4 1   # 1 [31 ]0
2 8 64,65,208,254,0.96915,0.143008,0.112364,0.133742,1.07969,0.142624 Common 6 2 6 2    # 2 [32 ]0
0 8 64,65,208,254,1.02998,0.140967,0.103248,0.0956228,1.12365,0.142766 Common 7 2 7 0   # 0 [30 ]0
5 8 58,65,209,254,0.954893,0.141873,0.129152,0.127772,1.06566,0.143759 Common 8 2 8 5   # 5 [35 ]0
6 8 64,65,221,255,0.994537,0.124062,0.124553,0.112235,1.09686,0.140651 Common 9 2 9 6   # 6 [36 ]0
3 8 58,65,209,254,0.949504,0.138527,0.123931,0.137766,1.06688,0.149633 Common 10 2 10 3 # 3 [33 ]0
9 8 58,68,209,254,0.993301,0.125705,0.156308,0.135153,1.11077,0.137772 Common 11 2 11 9 # 9 [39 ]0
8 8 64,65,222,255,1.00021,0.13277,0.106435,0.108115,1.09598,0.137484 Common 12 2 12 8   # 8 [38 ]0
7 8 58,70,207,254,0.946542,0.126568,0.154401,0.11639,1.05141,0.15142 Common 13 2 13 7   # 7 [37 ]0
4 8 59,69,209,253,1.01905,0.124317,0.100678,0.119407,1.11185,0.14632 Common 15 2 15 4   # 4 [34 ]0

Then I combine back to traineddata using:

combine_tessdata digitsbest2.

Then I run lstmtraining similarly to CASE 1 but with new traineddata file:

lstmtraining --model_output liczniki_model   \
--traineddata liczniki/digitsbest2.traineddata \
--old_traineddata liczniki/digitsbest.traineddata   \
--train_listfile liczniki.training_files.txt   \
--max_iterations 4000 --target_error_rate 0.1   \
--continue_from liczniki/digitsbest2.lstm

And get:

Loaded file liczniki/digitsbest2.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 19 to 19!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc19:19, 9747
Total weights = 1413811
Previous null char=18 mapped to 18
Continuing from liczniki/digitsbest2.lstm
Loaded 7/7 pages (1-7) of document /home/m/OCR/liczniki_lstmf/liczniki..100.lstmf
Loaded 8/8 pages (1-8) of document /home/m/OCR/liczniki_lstmf/liczniki..101.lstmf
Loaded 8/8 pages (1-8) of document /home/m/OCR/liczniki_lstmf/liczniki..102.lstmf
...
2 Percent improvement time=42, best error was 100 @ 0
At iteration 42/100/100, Mean rms=2.897%, delta=3.918%, char train=10.923%, word train=19.714%, skip ratio=0%,  New best char error = 10.923 wrote best model:liczniki_model10.923_42.checkpoint wrote checkpoint.

Loaded 7/7 pages (1-7) of document /home/m/OCR/liczniki_lstmf/liczniki..48.lstmf
Loaded 7/7 pages (1-7) of document /home/m/OCR/liczniki_lstmf/liczniki..49.lstmf
Loaded 6/6 pages (1-6) of document /home/m/OCR/liczniki_lstmf/liczniki..4.lstmf
....
2 Percent improvement time=6, best error was 10.923 @ 42
At iteration 48/200/200, Mean rms=1.786%, delta=2.045%, char train=5.923%, word train=10.286%, skip ratio=0%,  New best char error = 5.923 Transitioned to stage 1 wrote best model:liczniki_model5.923_48.checkpoint wrote checkpoint.

Naruszenie ochrony pamięci (zrzut pamięci)

"Naruszenie ochrony pamięci (zrzut pamięci)" - means "Segmentation fault (core dumped)"

Shreeshrii commented 6 years ago

If you want a digits model, you should try to plus-minus train from the english model in tessdata_best. See wiki page regarding training.

wosiu commented 6 years ago

I tried with eng.traineddata from tessdata_best. Same story - segmentation fault during traininglstm. Command:

lstmtraining --model_output liczniki_model   --traineddata ../tessdata/tessdata_best/tmp/eng.traineddata --old_traineddata ../tessdata/tessdata_best/eng.traineddata --train_listfile liczniki.training_files.txt   --max_iterations 4000 --target_error_rate 0.1   --continue_from ../tessdata/tessdata_best/tmp/eng.lstm

Where tessdata/tessdata_best/tmp/eng.traineddata is combined with edited eng.lstm-unicharset in a way that only digits are left.

Shreeshrii commented 6 years ago

============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning,

The unicharset, lstmf files, --train_listfile liczniki.training_files.txt - ALL these need to be in sync. You are removing lines from unicharset, but if the lstmf files have those characters, it will NOT work.

Please follow the proper training procedure as mentioned in the wiki.

wosiu commented 6 years ago

I am sure lstmf files contains only those characters which are in new unicharset file. Moreover when I use unicharset generated during lstmtraining I've got the same result - Seg fault. Note that training procedure mentioned in the wiki in the +/- character fine tunning assumes that image/box are generated from text using text2image which is not my case. I've got my own images with boxfiles which I inject into a process of generating lstmf files. Btw, maybe there is some problem? Should I get traineddata file during lstmf files generating? Currently in this process I get only lstmf files for each image/box pair and one unicharset. And what I'm doing next is I take old traineddata and replace unicharset with the new one. And such unicharset I use after --old_traineddata flag.

2018-01-26 13:14 GMT+01:00 Shreeshrii notifications@github.com:

============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning,

The unicharset, lstmf files, --train_listfile liczniki.training_files.txt

  • ALL these need to be in sync. You are removing lines from unicharset, but if the lstmf files have those characters, it will NOT work.

Please follow the proper training procedure as mentioned in the wiki.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1283#issuecomment-360769903, or mute the thread https://github.com/notifications/unsubscribe-auth/AEvAJJy2p8ynZnr6SPqgH4AbVxHgnoHvks5tOcG7gaJpZM4Rljfg .

Shreeshrii commented 6 years ago

Should I get traineddata file during lstmf files generating? Currently in this process I get only lstmf files for each image/box pair and one unicharset.

The LSTM training process now requires a starter traineddata.

tesstrain.sh process creates it after the lstmf files are created, by using the combine_lang_model.

If you want to follow a custom path for training, you should make sure your process has all the required steps (check tesstrain.sh and related batch files)

Currently tesseract does not support training from your own box/tif pairs.