tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

failure training #323

Closed ccampisano closed 1 year ago

ccampisano commented 1 year ago

Hi, I was able to build tesseract from git and run tesstrain script, but the latter failed this way:

corrado@debian:~/tesstrain$ make training MODEL_NAME=cdi
set -x; \
tesseract "data/cdi-ground-truth/12-174.png" data/cdi-ground-truth/12-174 --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/12-174.png data/cdi-ground-truth/12-174 --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/06-corrado.png" -t "data/cdi-ground-truth/06-corrado.gt.txt" > "data/cdi-ground-truth/06-corrado.box"
set -x; \
tesseract "data/cdi-ground-truth/06-corrado.png" data/cdi-ground-truth/06-corrado --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/06-corrado.png data/cdi-ground-truth/06-corrado --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/04-santa-marinella.png" -t "data/cdi-ground-truth/04-santa-marinella.gt.txt" > "data/cdi-ground-truth/04-santa-marinella.box"
set -x; \
tesseract "data/cdi-ground-truth/04-santa-marinella.png" data/cdi-ground-truth/04-santa-marinella --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/04-santa-marinella.png data/cdi-ground-truth/04-santa-marinella --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/14-emissione.png" -t "data/cdi-ground-truth/14-emissione.gt.txt" > "data/cdi-ground-truth/14-emissione.box"
set -x; \
tesseract "data/cdi-ground-truth/14-emissione.png" data/cdi-ground-truth/14-emissione --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/14-emissione.png data/cdi-ground-truth/14-emissione --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/08-ca.png" -t "data/cdi-ground-truth/08-ca.gt.txt" > "data/cdi-ground-truth/08-ca.box"
set -x; \
tesseract "data/cdi-ground-truth/08-ca.png" data/cdi-ground-truth/08-ca --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/08-ca.png data/cdi-ground-truth/08-ca --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/10-fd.png" -t "data/cdi-ground-truth/10-fd.gt.txt" > "data/cdi-ground-truth/10-fd.box"
set -x; \
tesseract "data/cdi-ground-truth/10-fd.png" data/cdi-ground-truth/10-fd --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/10-fd.png data/cdi-ground-truth/10-fd --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/13-ita.png" -t "data/cdi-ground-truth/13-ita.gt.txt" > "data/cdi-ground-truth/13-ita.box"
set -x; \
tesseract "data/cdi-ground-truth/13-ita.png" data/cdi-ground-truth/13-ita --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/13-ita.png data/cdi-ground-truth/13-ita --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/11-m.png" -t "data/cdi-ground-truth/11-m.gt.txt" > "data/cdi-ground-truth/11-m.box"
set -x; \
tesseract "data/cdi-ground-truth/11-m.png" data/cdi-ground-truth/11-m --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/11-m.png data/cdi-ground-truth/11-m --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/15-scadenza.png" -t "data/cdi-ground-truth/15-scadenza.gt.txt" > "data/cdi-ground-truth/15-scadenza.box"
set -x; \
tesseract "data/cdi-ground-truth/15-scadenza.png" data/cdi-ground-truth/15-scadenza --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/15-scadenza.png data/cdi-ground-truth/15-scadenza --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/09-63452.png" -t "data/cdi-ground-truth/09-63452.gt.txt" > "data/cdi-ground-truth/09-63452.box"
set -x; \
tesseract "data/cdi-ground-truth/09-63452.png" data/cdi-ground-truth/09-63452 --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/09-63452.png data/cdi-ground-truth/09-63452 --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/01-repubblica-italiana.png" -t "data/cdi-ground-truth/01-repubblica-italiana.gt.txt" > "data/cdi-ground-truth/01-repubblica-italiana.box"
set -x; \
tesseract "data/cdi-ground-truth/01-repubblica-italiana.png" data/cdi-ground-truth/01-repubblica-italiana --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/01-repubblica-italiana.png data/cdi-ground-truth/01-repubblica-italiana --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/02-ministero-interno.png" -t "data/cdi-ground-truth/02-ministero-interno.gt.txt" > "data/cdi-ground-truth/02-ministero-interno.box"
set -x; \
tesseract "data/cdi-ground-truth/02-ministero-interno.png" data/cdi-ground-truth/02-ministero-interno --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/02-ministero-interno.png data/cdi-ground-truth/02-ministero-interno --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/05-campisano.png" -t "data/cdi-ground-truth/05-campisano.gt.txt" > "data/cdi-ground-truth/05-campisano.box"
set -x; \
tesseract "data/cdi-ground-truth/05-campisano.png" data/cdi-ground-truth/05-campisano --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/05-campisano.png data/cdi-ground-truth/05-campisano --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/07-luogo-data.png" -t "data/cdi-ground-truth/07-luogo-data.gt.txt" > "data/cdi-ground-truth/07-luogo-data.box"
set -x; \
tesseract "data/cdi-ground-truth/07-luogo-data.png" data/cdi-ground-truth/07-luogo-data --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/07-luogo-data.png data/cdi-ground-truth/07-luogo-data --psm 13 lstm.train
read_params_file: Can't open lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/cdi-ground-truth/03-carta-di-identita.png" -t "data/cdi-ground-truth/03-carta-di-identita.gt.txt" > "data/cdi-ground-truth/03-carta-di-identita.box"
set -x; \
tesseract "data/cdi-ground-truth/03-carta-di-identita.png" data/cdi-ground-truth/03-carta-di-identita --psm 13 lstm.train
+ tesseract data/cdi-ground-truth/03-carta-di-identita.png data/cdi-ground-truth/03-carta-di-identita --psm 13 lstm.train
read_params_file: Can't open lstm.train
python3 shuffle.py 0 "data/cdi/all-lstmf"
+ head -n 13 data/cdi/all-lstmf
+ tail -n 2 data/cdi/all-lstmf
combine_lang_model \
  --input_unicharset data/cdi/unicharset \
  --script_dir data/langdata \
  --numbers data/cdi/cdi.numbers \
  --puncs data/cdi/cdi.punc \
  --words data/cdi/cdi.wordlist \
  --output_dir data \
   \
  --lang cdi
Failed to read data from: data/cdi/cdi.wordlist
Failed to read data from: data/cdi/cdi.punc
Failed to read data from: data/cdi/cdi.numbers
Loaded unicharset of size 32 from file data/cdi/unicharset
Setting unichar properties
Other case c of C is not in unicharset
Other case o of O is not in unicharset
Other case r of R is not in unicharset
Other case a of A is not in unicharset
Other case d of D is not in unicharset
Other case s of S is not in unicharset
Other case n of N is not in unicharset
Other case t of T is not in unicharset
Other case m of M is not in unicharset
Other case i of I is not in unicharset
Other case e of E is not in unicharset
Other case l of L is not in unicharset
Other case f of F is not in unicharset
Other case p of P is not in unicharset
Other case u of U is not in unicharset
Other case b of B is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: data/langdata/cdi/cdi.config
Null char=2
lstmtraining \
  --debug_interval 0 \
  --traineddata data/cdi/cdi.traineddata \
  --learning_rate 0.002 \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c`head -n1 data/cdi/unicharset`]" \
  --model_output data/cdi/checkpoints/cdi \
  --train_listfile data/cdi/list.train \
  --eval_listfile data/cdi/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Warning: given outputs 32 not equal to unicharset of 31.
Num outputs,weights in Series:
  1,36,0,1:1, 0
Num outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys48:48, 12480
  Lfx96:96, 55680
  RxLrx96:96, 74112
  Lfx192:192, 221952
  Fc31:31, 5983
Total weights = 370367
Built network:[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc31] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c32]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=30
Deserialize header failed: data/cdi-ground-truth/12-174.lstmf
Deserialize header failed: data/cdi-ground-truth/13-ita.lstmf
Deserialize header failed: data/cdi-ground-truth/08-ca.lstmf
Deserialize header failed: data/cdi-ground-truth/03-carta-di-identita.lstmf
Deserialize header failed: data/cdi-ground-truth/07-luogo-data.lstmf
Deserialize header failed: data/cdi-ground-truth/11-m.lstmf
Deserialize header failed: data/cdi-ground-truth/09-63452.lstmf
Deserialize header failed: data/cdi-ground-truth/01-repubblica-italiana.lstmf
Deserialize header failed: data/cdi-ground-truth/10-fd.lstmf
Load of page 0 failed!
Load of images failed!!
make: *** [Makefile:326: data/cdi/checkpoints/cdi_checkpoint] Error 1

any hints?

thx and rgrds, corrado

Shawnsdaddy commented 1 year ago

Having the same issue

zdenop commented 1 year ago

Please provide the test case (all files) to reproduce the problem.

ccampisano commented 1 year ago

@zdenop here's the training material

thx and regards, corrado cdi-ground-truth.zip

zdenop commented 1 year ago

Please post also each steps (commands you run) what you did for reproducing problem.

ccampisano commented 1 year ago

the only command I ran was "_make training MODELNAME=cdi"

sven-nm commented 1 year ago

Having exactly the same issue here since reinstalling tesseract, despite lstm.train being in tessdata_dir/configs

make training MODEL_NAME=test_trained START_MODEL=grc OUTPUT_DIR=/scratch/sven/ocr_exp/models/test/train GROUND_TRUTH_DIR=/scratch/sven/ocr_exp/datasets/test CORES=12 EPOCHS=1 
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_87.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_87 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_87.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_87 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_71.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_71 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_71.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_71 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_88.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_88 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_88.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_88 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_92.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_92 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_92.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_92 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_65.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_65 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_65.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_65 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_17.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_17 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_17.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_17 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_69.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_69 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_69.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_69 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_24.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_24 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_24.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_24 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_73.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_73 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_73.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_73 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_60.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_60 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_60.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_60 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_74.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_74 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_74.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_74 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_91.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_91 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_91.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_91 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_68.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_68 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_68.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_68 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_7.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_7 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_7.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_7 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_64.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_64 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_64.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_64 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_21.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_21 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_21.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_21 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_10.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_10 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_10.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_10 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_93.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_93 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_93.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_93 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_19.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_19 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_19.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_19 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_2.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_2 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_2.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_2 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_82.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_82 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_82.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_82 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_25.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_25 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_25.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_25 --psm 13 lstm.train
read_params_file: Can't open lstm.train
set -x; \
tesseract "/scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_75.png" /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_75 --psm 13 lstm.train
+ tesseract /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_75.png /scratch/sven/ocr_exp/datasets/test/sophoclesplaysa05campgoog_0336_75 --psm 13 lstm.train
read_params_file: Can't open lstm.train
python3 shuffle.py 0 "/scratch/sven/ocr_exp/models/test/train/all-lstmf"
/bin/bash: line 1: bc: command not found
/bin/bash: line 4: bc: command not found
+ head -n '' /scratch/sven/ocr_exp/models/test/train/all-lstmf
head: invalid number of lines: ''
+ tail -n '' /scratch/sven/ocr_exp/models/test/train/all-lstmf
tail: invalid number of lines: ''
make: *** [Makefile:191: /scratch/sven/ocr_exp/models/test/train/list.train] Error 1
zdenop commented 1 year ago

read_params_file: Can't open lstm.train indicates that there is a problem with the tesseract installation. How did you install tesseract?

bc: command not found indicated that bc utility is not in the path.

ccampisano commented 1 year ago

read_params_file: Can't open lstm.train indicates that there is a problem with the tesseract installation. How did you install tesseract?

bc: command not found indicated that bc utility is not in the path.

I installed tesseract from the git repo, doing configure, make, etc.

How should I install it?

BTW: "bc" was installed (Already to the newest version 1.07.1-2+b2)

zdenop commented 1 year ago

@ccampisano 'bc' is issue of @sven-nm who think is has the same problem as you... please post installation log of tesseract.

ccampisano commented 1 year ago

@zdenop I didn't record the installation log, but it went fine. I'll redo and report here asap.

zdenop commented 1 year ago

See simular issue https://github.com/tesseract-ocr/tesstrain/issues/325 - please try clean installation (uninstall everything and install from scratch). First try sample data and if it works, try your data...

ccampisano commented 1 year ago

@zdenop please find attached installation logs, I followed instructions in the repo's readme.

Notice I had a problem during configure and had to run it with --disable-dependency-tracking

install.log config.log

Please let me know what to do next, my aim is to be able to create custom traindata.

zdenop commented 1 year ago

Can you please post output of following commands? echo $TESSDATA_PREFIX and tesseract a b -l c

ccampisano commented 1 year ago

@zdenop here's the results:

corrado@tesseract:~$ echo $TESSDATA_PREFIX

corrado@tesseract:~$ tesseract a b -l c
Error opening data file /usr/local/share/tessdata/c.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'c'
Tesseract couldn't load any languages!
Could not initialize tesseract.
zdenop commented 1 year ago

According data you posted you installed tesseract to /usr/local/bin, and tesseract search for its data in subdirectories of /usr/local/share/tessdata/, (lstm.train is installed to /usr/local/share/tessdata/configs)... So tesseract is installed correctly . Can you please double check if there is no other tesseract instalation (e.g. in /usr/bin )?

Can you now run make training MODEL_NAME=cdi?

ccampisano commented 1 year ago

@zdenop there is no other tesseract installation:

corrado@tesseract:~$ ls /usr/bin/ | grep tess
corrado@tesseract:~$ which tesseract 
/usr/local/bin/tesseract

root@tesseract:~# apt remove tesseract-ocr
Lettura elenco dei pacchetti... Fatto
Generazione albero delle dipendenze... Fatto
Lettura informazioni sullo stato... Fatto   
Il pacchetto "tesseract-ocr" non è installato e quindi non è stato rimosso
0 aggiornati, 0 installati, 0 da rimuovere e 0 non aggiornati.

BTW: 1) I didn't run make training and sudo make training-install yet, should I? (see here) 2) should I run make training MODEL_NAME=cdi from the tesseract folder where I worked so far, or in the tesstrain folder? 3) where to put the training data folder?

thanks corrado

zdenop commented 1 year ago

Yes, please run sudo make training-install first.

Maybe please first run training on example data (see e.g. this tutorial - just skip installing tesseract as you already did it manually... )

Also you need to install eng.traineddata and osd.traineddata (make tesseract-langs in tesstrain - see README.)

ccampisano commented 1 year ago

@zdenop thanks for your support, I was able to run the traininig correctly (and didn't need osd.traineddata).

The trained file was correctly generated, but:

how could I improve this?

zdenop commented 1 year ago

Congratulation!

'its performances are very poor, compared to the regular "ita" file'

It is in line with documentation. Did you read it? Or did you expected that with 10 minutes training you will get better result than Google with its resources?