tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
641 stars 191 forks source link

Failed to read data/Assert failed #394

Open T0biasCZe opened 5 months ago

T0biasCZe commented 5 months ago

When trying to fine tune model, i get Failed to read data errors and then assert failed error

C:\Users\tobik\source\repos\tesstrain>make training MODEL_NAME=ocrd-testset START_MODEL=ces TESSDATA=C:\tessdata
You are using make version: 4.4.1
combine_tessdata -u C:\tessdata/ces.traineddata data/ces/ocrd-testset
Extracting tessdata components from C:\tessdata/ces.traineddata
Wrote data/ces/ocrd-testset.lstm
Wrote data/ces/ocrd-testset.lstm-punc-dawg
Wrote data/ces/ocrd-testset.lstm-word-dawg
Wrote data/ces/ocrd-testset.lstm-number-dawg
Wrote data/ces/ocrd-testset.lstm-unicharset
Wrote data/ces/ocrd-testset.lstm-recoder
Wrote data/ces/ocrd-testset.version
Version:4.00.00alpha:ces:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx384O1c1]
17:lstm:size=7541987, offset=192
18:lstm-punc-dawg:size=322, offset=7542179
19:lstm-word-dawg:size=3366074, offset=7542501
20:lstm-number-dawg:size=2114, offset=10908575
21:lstm-unicharset:size=7028, offset=10910689
22:lstm-recoder:size=1111, offset=10917717
23:version:size=80, offset=10918828
unicharset_extractor --output_unicharset "data/ocrd-testset/my.unicharset" --norm_mode 2 "data/ocrd-testset/all-gt"
Extracting unicharset from plain text file data/ocrd-testset/all-gt
Other case W of w is not in unicharset
Other case O of o is not in unicharset
Other case R of r is not in unicharset
Other case I of i is not in unicharset
Other case U of u is not in unicharset
Other case E of e is not in unicharset
Other case G of g is not in unicharset
Other case k of K is not in unicharset
Other case V of v is not in unicharset
Other case Y of y is not in unicharset
Other case Z of z is not in unicharset
Other case J of j is not in unicharset
Wrote unicharset file data/ocrd-testset/my.unicharset
merge_unicharsets data/ces/ocrd-testset.lstm-unicharset data/ocrd-testset/my.unicharset "data/ocrd-testset/unicharset"
Loaded unicharset of size 123 from file data/ces/ocrd-testset.lstm-unicharset
Loaded unicharset of size 45 from file data/ocrd-testset/my.unicharset
Wrote unicharset file data/ocrd-testset/unicharset.
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0105_008 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0117_023 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0127_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0155_024 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0175_017 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0188_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0223_018 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0245_023 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0287_011 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.tif" -t "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.gt.txt" > "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.box"
tesseract "data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006.tif" data/ocrd-testset-ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 13 lstm.train
python shuffle.py 0 "data/ocrd-testset/all-lstmf"
python generate_eval_train.py data/ocrd-testset/all-lstmf 0.90

dos2unix "data/ocrd-testset/ocrd-testset.numbers"
dos2unix: data/ocrd-testset/ocrd-testset.numbers: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.numbers, not a regular file.
make: [Makefile:290: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/ocrd-testset/ocrd-testset.punc"
dos2unix: data/ocrd-testset/ocrd-testset.punc: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.punc, not a regular file.
make: [Makefile:291: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/ocrd-testset/ocrd-testset.wordlist"
dos2unix: data/ocrd-testset/ocrd-testset.wordlist: No such file or directory
dos2unix: Skipping data/ocrd-testset/ocrd-testset.wordlist, not a regular file.
make: [Makefile:292: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
dos2unix "data/langdata/ocrd-testset/ocrd-testset.config"
dos2unix: data/langdata/ocrd-testset/ocrd-testset.config: No such file or directory
dos2unix: Skipping data/langdata/ocrd-testset/ocrd-testset.config, not a regular file.
make: [Makefile:293: data/ocrd-testset/ocrd-testset.traineddata] Error 2 (ignored)
combine_lang_model \
  --input_unicharset data/ocrd-testset/unicharset \
  --script_dir data/langdata \
  --numbers data/ocrd-testset/ocrd-testset.numbers \
  --puncs data/ocrd-testset/ocrd-testset.punc \
  --words data/ocrd-testset/ocrd-testset.wordlist \
  --output_dir data \
   \
  --lang ocrd-testset
Failed to read data from: data/ocrd-testset/ocrd-testset.wordlist
Failed to read data from: data/ocrd-testset/ocrd-testset.punc
Failed to read data from: data/ocrd-testset/ocrd-testset.numbers
Loaded unicharset of size 126 from file data/ocrd-testset/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Config file is optional, continuing...
Failed to read data from: data/langdata/ocrd-testset/ocrd-testset.config
Null char=2
Created data/ocrd-testset/ocrd-testset.traineddata
lstmtraining \
  --debug_interval 0 \
  --traineddata data/ocrd-testset/ocrd-testset.traineddata \
  --old_traineddata C:\tessdata/ces.traineddata \
  --continue_from data/ces/ocrd-testset.lstm \
  --learning_rate 0.0001 \
  --model_output data/ocrd-testset/checkpoints/ocrd-testset \
  --train_listfile data/ocrd-testset/list.train \
  --eval_listfile data/ocrd-testset/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/ocrd-testset/training.log
Loaded file data/ces/ocrd-testset.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 122 to 125!
old_mgr.Init(old_traineddata):Error:Assert failed:in file ../../../src/training/unicharset/lstmtrainer.cpp, line 132

lstmtraining \
--stop_training \
--continue_from data/ocrd-testset/checkpoints/ocrd-testset_checkpoint \
--traineddata data/ocrd-testset/ocrd-testset.traineddata \
--model_output data/ocrd-testset.traineddata
Failed to read continue from: data/ocrd-testset/checkpoints/ocrd-testset_checkpoint
make: *** [Makefile:325: data/ocrd-testset.traineddata] Error 1
zdenop commented 5 months ago

What version of tesseract you use?

stweil commented 5 months ago

I get a slightly different output and no crash when I try this on Debian GNU Linux:

$ lstmtraining \
  --debug_interval 0 \
  --traineddata data/ocrd-testset/ocrd-testset.traineddata \
  --old_traineddata ../tessdata_best/ces.traineddata \
  --continue_from data/ces/ocrd-testset.lstm \
  --learning_rate 0.0001 \
  --model_output data/ocrd-testset/checkpoints/ocrd-testset \
  --train_listfile data/ocrd-testset/list.train \
  --eval_listfile data/ocrd-testset/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/ocrd-testset/training.log
Loaded file data/ces/ocrd-testset.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 122 to 131!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys64:64, 20736
  Lfx96:96, 61824
  RxLrx96:96, 74112
  Lfx384:384, 738816
  Fc131:131, 50435
Total weights = 946083
Previous null char=121 mapped to 130
Continuing from data/ces/ocrd-testset.lstm
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, mean rms=2.136%, delta=7.610%, BCER train=27.051%, BWER train=59.946%, skip ratio=0.000%, New best BCER = 27.051 wrote best model:data/ocrd-testset/checkpoints/ocrd-testset_27.051_100_100.checkpoint wrote checkpoint.
2 Percent improvement time=100, best error was 27.051 @ 100
At iteration 200/200/200, mean rms=1.956%, delta=6.367%, BCER train=24.516%, BWER train=54.783%, skip ratio=0.000%, New best BCER = 24.516 wrote best model:data/ocrd-testset/checkpoints/ocrd-testset_24.516_200_200.checkpoint wrote checkpoint.
zdenop commented 5 months ago

I tried the recent code and 5.4.0 and I am not able to reproduce it.

tesseract -v
tesseract 5.4.0
 leptonica-1.84.2 (May 13 2024, 19:39:23) [MSC v.1929 LIB Release x64]
  libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 2.1.90) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.2.13.zlib-ng : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 200203

I have ICU version 74.2.

briannicholas commented 4 months ago

Had the same problem.

It's a windows issue. You need to specify the TESSDATA path using forward slashes

so for the op,

C:\Users\tobik\source\repos\tesstrain>make training MODEL_NAME=ocrd-testset START_MODEL=ces TESSDATA=C:/tessdata

rather than

TESSDATA=C:\tessdata