tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Segmentation fault when trying to train tesseract with given training data #191

Closed abandonware-magazines closed 3 years ago

abandonware-magazines commented 3 years ago

I'm hitting a segmentation fault when trying to train tesseract with the following data: ground_truth.zip

Steps to reproduce:

  1. Build: make leptonica tesseract
  2. Download Hebrew training data: ~/Workspace/tesstrain/usr/share/tessdata (master)$ wget https://github.com/tesseract-ocr/tessdata/raw/master/heb.traineddata
  3. Unzip attached ground truth: ~/Workspace/tesstrain (master)$ unzip /path/to/ground_truth.zip -d data/my_proj-ground-truth
  4. Start training: ~/Workspace/tesstrain (master)$ make training MODEL_NAME=my_proj START_MODEL=heb LANG_TYPE=RTL

Output:

$ make training MODEL_NAME=my_proj START_MODEL=heb LANG_TYPE=RTL
find data/my_proj-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/my_proj/all-gt"
combine_tessdata -u /home/owner/Workspace/tesstrain/usr/share/tessdata/heb.traineddata  data/heb/my_proj
Extracting tessdata components from /home/owner/Workspace/tesstrain/usr/share/tessdata/heb.traineddata
Wrote data/heb/my_proj.unicharset
Wrote data/heb/my_proj.unicharambigs
Wrote data/heb/my_proj.inttemp
Wrote data/heb/my_proj.pffmtable
Wrote data/heb/my_proj.normproto
Wrote data/heb/my_proj.punc-dawg
Wrote data/heb/my_proj.word-dawg
Wrote data/heb/my_proj.number-dawg
Wrote data/heb/my_proj.freq-dawg
Wrote data/heb/my_proj.shapetable
Wrote data/heb/my_proj.bigram-dawg
Wrote data/heb/my_proj.lstm
Wrote data/heb/my_proj.lstm-punc-dawg
Wrote data/heb/my_proj.lstm-word-dawg
Wrote data/heb/my_proj.lstm-number-dawg
Wrote data/heb/my_proj.lstm-unicharset
Wrote data/heb/my_proj.lstm-recoder
Wrote data/heb/my_proj.version
Version string:Pre-4.0.0+4.00.00alpha:heb:best2int20180322
1:unicharset:size=5241, offset=192
2:unicharambigs:size=3480, offset=5433
3:inttemp:size=839668, offset=8913
4:pffmtable:size=602, offset=848581
5:normproto:size=8989, offset=849183
6:punc-dawg:size=2402, offset=858172
7:word-dawg:size=1318826, offset=860574
8:number-dawg:size=1362, offset=2179400
9:freq-dawg:size=1362, offset=2180762
13:shapetable:size=34150, offset=2182124
14:bigram-dawg:size=2122794, offset=2216274
17:lstm:size=393194, offset=4339068
18:lstm-punc-dawg:size=1378, offset=4732262
19:lstm-word-dawg:size=673826, offset=4733640
20:lstm-number-dawg:size=1298, offset=5407466
21:lstm-unicharset:size=4023, offset=5408764
22:lstm-recoder:size=625, offset=5412787
23:version:size=43, offset=5413412
unicharset_extractor --output_unicharset "data/my_proj/my.unicharset" --norm_mode 3 "data/my_proj/all-gt"
Bad box coordinates in boxfile string! מול המחשב. הוא רצה להספיק לבצע כמה דברים לפני שילך ראשית הוא בדק אם נשלחו אליו הודעות. ואכן, ירון שלח אליו אנגלי-עברי. יואב מיהר להכניס את המילון לתיק וחזר אל הודעה ובה הוא מבקש להזכיר לו להביא לבית הספר מילון המחשב האישי שבחדרו. לאחר ארוחת הבוקר הביט יואב פעם לבית הספר. יואב זרק מעליו את השמיכה, קם בזריזות הוריד מן המקרר את רשימת הקניות שאמו הכינה לו ו...התיישב מהמיטה, ובדרכו אל המקלחת הדליק את נוספת בשעון. השעה הייתה 07:20. "יופי, יש לי מספיק זמן." הוא
Extracting unicharset from plain text file data/my_proj/all-gt
Wrote unicharset file data/my_proj/my.unicharset
merge_unicharsets data/heb/my_proj.lstm-unicharset data/my_proj/my.unicharset  "data/my_proj/unicharset"
Loaded unicharset of size 69 from file data/heb/my_proj.lstm-unicharset
Loaded unicharset of size 36 from file data/my_proj/my.unicharset
Wrote unicharset file data/my_proj/unicharset.
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/6.png" -t "data/my_proj-ground-truth/6.gt.txt" > "data/my_proj-ground-truth/6.box"
+ tesseract data/my_proj-ground-truth/6.png data/my_proj-ground-truth/6 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/8.png" -t "data/my_proj-ground-truth/8.gt.txt" > "data/my_proj-ground-truth/8.box"
+ tesseract data/my_proj-ground-truth/8.png data/my_proj-ground-truth/8 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/10.png" -t "data/my_proj-ground-truth/10.gt.txt" > "data/my_proj-ground-truth/10.box"
+ tesseract data/my_proj-ground-truth/10.png data/my_proj-ground-truth/10 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/9.png" -t "data/my_proj-ground-truth/9.gt.txt" > "data/my_proj-ground-truth/9.box"
+ tesseract data/my_proj-ground-truth/9.png data/my_proj-ground-truth/9 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/3.png" -t "data/my_proj-ground-truth/3.gt.txt" > "data/my_proj-ground-truth/3.box"
+ tesseract data/my_proj-ground-truth/3.png data/my_proj-ground-truth/3 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/7.png" -t "data/my_proj-ground-truth/7.gt.txt" > "data/my_proj-ground-truth/7.box"
+ tesseract data/my_proj-ground-truth/7.png data/my_proj-ground-truth/7 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/1.png" -t "data/my_proj-ground-truth/1.gt.txt" > "data/my_proj-ground-truth/1.box"
+ tesseract data/my_proj-ground-truth/1.png data/my_proj-ground-truth/1 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/5.png" -t "data/my_proj-ground-truth/5.gt.txt" > "data/my_proj-ground-truth/5.box"
+ tesseract data/my_proj-ground-truth/5.png data/my_proj-ground-truth/5 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/2.png" -t "data/my_proj-ground-truth/2.gt.txt" > "data/my_proj-ground-truth/2.box"
+ tesseract data/my_proj-ground-truth/2.png data/my_proj-ground-truth/2 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/my_proj-ground-truth/4.png" -t "data/my_proj-ground-truth/4.gt.txt" > "data/my_proj-ground-truth/4.box"
+ tesseract data/my_proj-ground-truth/4.png data/my_proj-ground-truth/4 --psm 13 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
find data/my_proj-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/my_proj/all-lstmf"
+ head -n 9 data/my_proj/all-lstmf
+ tail -n 1 data/my_proj/all-lstmf
combine_lang_model \
  --input_unicharset data/my_proj/unicharset \
  --script_dir data \
  --numbers data/my_proj/my_proj.numbers \
  --puncs data/my_proj/my_proj.punc \
  --words data/my_proj/my_proj.wordlist \
  --output_dir data \
  --pass_through_recoder --lang_is_rtl \
  --lang my_proj
Failed to read data from: data/my_proj/my_proj.wordlist
Failed to read data from: data/my_proj/my_proj.punc
Failed to read data from: data/my_proj/my_proj.numbers
Loaded unicharset of size 69 from file data/my_proj/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/Hebrew.unicharset
Failed to load script unicharset from:data/Latin.unicharset
Warning: properties incomplete for index 3 = ,
Warning: properties incomplete for index 4 = ג
Warning: properties incomplete for index 5 = '
Warning: properties incomplete for index 6 = א
Warning: properties incomplete for index 7 = ק
Warning: properties incomplete for index 8 = ו
Warning: properties incomplete for index 9 = מ
Warning: properties incomplete for index 10 = ה
Warning: properties incomplete for index 11 = ד
Warning: properties incomplete for index 12 = ל
Warning: properties incomplete for index 13 = .
Warning: properties incomplete for index 14 = ת
Warning: properties incomplete for index 15 = ם
Warning: properties incomplete for index 16 = ?
Warning: properties incomplete for index 17 = 1
Warning: properties incomplete for index 18 = 5
Warning: properties incomplete for index 19 = 9
Warning: properties incomplete for index 20 = -
Warning: properties incomplete for index 21 = 4
Warning: properties incomplete for index 22 = ש
Warning: properties incomplete for index 23 = נ
Warning: properties incomplete for index 24 = ב
Warning: properties incomplete for index 25 = י
Warning: properties incomplete for index 26 = "
Warning: properties incomplete for index 27 = פ
Warning: properties incomplete for index 28 = ר
Warning: properties incomplete for index 29 = ח
Warning: properties incomplete for index 30 = צ
Warning: properties incomplete for index 31 = 3
Warning: properties incomplete for index 32 = 6
Warning: properties incomplete for index 33 = ס
Warning: properties incomplete for index 34 = )
Warning: properties incomplete for index 35 = :
Warning: properties incomplete for index 36 = 2
Warning: properties incomplete for index 37 = 0
Warning: properties incomplete for index 38 = כ
Warning: properties incomplete for index 39 = ז
Warning: properties incomplete for index 40 = ט
Warning: properties incomplete for index 41 = ן
Warning: properties incomplete for index 42 = /
Warning: properties incomplete for index 43 = (
Warning: properties incomplete for index 44 = 8
Warning: properties incomplete for index 45 = 7
Warning: properties incomplete for index 46 = %
Warning: properties incomplete for index 47 = +
Warning: properties incomplete for index 48 = ץ
Warning: properties incomplete for index 49 = ע
Warning: properties incomplete for index 50 = ך
Warning: properties incomplete for index 51 = ;
Warning: properties incomplete for index 52 = !
Warning: properties incomplete for index 53 = ְ
Warning: properties incomplete for index 54 = ַ
Warning: properties incomplete for index 55 = ָ
Warning: properties incomplete for index 56 = ּ
Warning: properties incomplete for index 57 = *
Warning: properties incomplete for index 58 = ף
Warning: properties incomplete for index 59 = ִ
Warning: properties incomplete for index 60 = \
Warning: properties incomplete for index 61 = |
Warning: properties incomplete for index 62 = ֶ
Warning: properties incomplete for index 63 = >
Warning: properties incomplete for index 64 = ]
Warning: properties incomplete for index 65 = [
Warning: properties incomplete for index 66 = ₪
Warning: properties incomplete for index 67 = =
Warning: properties incomplete for index 68 = <
Config file is optional, continuing...
Failed to read data from: data/my_proj/my_proj.config
lstmtraining \
  --debug_interval 0 \
  --traineddata data/my_proj/my_proj.traineddata \
  --old_traineddata /home/owner/Workspace/tesstrain/usr/share/tessdata/heb.traineddata \
  --continue_from data/heb/my_proj.lstm \
  --learning_rate 0.0001 \
  --model_output data/my_proj/checkpoints/my_proj \
  --train_listfile data/my_proj/list.train \
  --eval_listfile data/my_proj/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Loaded file data/heb/my_proj.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 69 to 69!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys48:48, 12480
  Lfx96:96, 55680
  Lrx96:96, 74112
  Lfx192:192, 221952
  Fc69:69, 0
Total weights = 364384
Previous null char=2 mapped to 2
Continuing from data/heb/my_proj.lstm
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/9.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/4.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/10.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/5.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/2.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/8.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/1.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/7.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/3.lstmf
Loaded 1/1 lines (1-1) of document data/my_proj-ground-truth/6.lstmf
Makefile:266: recipe for target 'data/my_proj/checkpoints/my_proj_checkpoint' failed
make: *** [data/my_proj/checkpoints/my_proj_checkpoint] Segmentation fault (core dumped)

Any idea what might be the problem?

Thanks in advance!

Shreeshrii commented 3 years ago

--old_traineddata /home/owner/Workspace/tesstrain/usr/share/tessdata/heb.traineddata

Is this from the tessdata_best repo?

Training does not work from default/fast/integer models?

abandonware-magazines commented 3 years ago

Thanks! I was using "fast" (https://github.com/tesseract-ocr/tessdata/raw/master/heb.traineddata). Moved to "best" (https://github.com/tesseract-ocr/tessdata_best/raw/master/heb.traineddata) and it started to work.