tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

New Makefile to do lstmtraining from font and training_text using tesstrain.py #230

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

This can be tested using the following bash script:

TESSDATA=$HOME/tessdata_best
MODEL_NAME=engnew
DATA_DIR=data
START_MODEL=eng

## NEW Variables
TESSTRAIN_TEXT=$HOME/langdata/$START_MODEL/$START_MODEL.training_text 
TESSTRAIN_FONT=Arial
TESSTRAIN_MAX_LINES=10

rm -rf $DATA_DIR/$MODEL_NAME-ground-truth $DATA_DIR/$MODEL_NAME
mkdir -p $DATA_DIR/$MODEL_NAME-ground-truth $DATA_DIR/$MODEL_NAME

# generate new training data
python ./src/training/tesstrain.py \
 --fonts_dir $HOME/.fonts \
 --fontlist "$TESSTRAIN_FONT" \
 --lang $START_MODEL \
 --linedata_only \
 --noextract_font_properties \
 --exposures "0"    \
 --langdata_dir $HOME/langdata_lstm \
 --training_text $TESSTRAIN_TEXT \
 --tessdata_dir $TESSDATA \
 --save_box_tiff \
 --maxpages $TESSTRAIN_MAX_LINES \
 --output_dir $DATA_DIR/$MODEL_NAME-ground-truth

mv $DATA_DIR/$MODEL_NAME-ground-truth/$START_MODEL.training_files.txt $DATA_DIR/$MODEL_NAME/all-lstmf
mv $DATA_DIR/$MODEL_NAME-ground-truth/$START_MODEL/*.* $DATA_DIR/$MODEL_NAME/

So, output of the script gives us the lstmf files (optionally the box/tiff pairs also), all-lstmf file and the PROTO_MODEL. The Makefile may need to be modified to not complain about missing .gt.txt/box/tif files and start directly from the lstmtraining step.

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 50c00d80a2d3af2bb7440156878b3586fb141355 into 5fe64c252f3634f6b89606aa239b61544ecc7b42 - view on LGTM.com

new alerts:

Shreeshrii commented 3 years ago

@egorpugin Request you to review changes to src/training/tesstrain_utils.py

Shreeshrii commented 3 years ago

Two sample bash scripts show how to invoke the font2model Makefile.

Posting console logs from both as separate posts below, rather than attaching as files.

egorpugin commented 3 years ago

Also ask someone else, because I'm not really familiar with training tools scripts.

Shreeshrii commented 3 years ago
(base) ubuntu@tesseract-ocr-1:~/tesstrain$ bash -x font2model.sh eng Latin eng engINR FineTune 'Arial'
+ make -f Makefile-font2model MODEL_NAME=engINR clean-groundtruth clean-output clean-log
Makefile-font2model:212: warning: overriding recipe for target 'data/Latin.unicharset'
Makefile-font2model:209: warning: ignoring old recipe for target 'data/Latin.unicharset'
rm -rf data/engINR-ground-truth
rm -rf data/engINR
rm -rf data/engINR.log
+ make -f Makefile-font2model TESSDATA=/home/ubuntu/tessdata_best TESSTRAIN_FONTS_DIR=/usr/share/fonts TESSTRAIN_TEXT=data/engINR.training_text TESSTRAIN_MAX_LINES=100 EPOCHS=100 TESSTRAIN_LANG=eng TESSTRAIN_SCRIPT=Latin START_MODEL=eng MODEL_NAME=engINR TRAIN_TYPE=FineTune TESSTRAIN_FONT=Arial DEBUG_INTERVAL=-1 training --trace
Makefile-font2model:212: warning: overriding recipe for target 'data/Latin.unicharset'
Makefile-font2model:209: warning: ignoring old recipe for target 'data/Latin.unicharset'
make: Circular data/Latin.unicharset <- data/Latin.unicharset dependency dropped.
Makefile-font2model:192: update target 'data/engINR/engINR.traineddata' due to: data/Latin.unicharset
python ./src/training/tesstrain.py \
 --fonts_dir /usr/share/fonts \
 --fontlist Arial \
 --maxpages 100 \
 --lang eng \
 --langdata_dir data \
 --training_text data/engINR.training_text \
 --tessdata_dir /home/ubuntu/tessdata_best \
 --linedata_only --noextract_font_properties \
 --exposures "0" --save_box_tiff  \
 --output_dir data/engINR-ground-truth
[13:29:20] INFO - Log file location: /tmp/eng-2021-02-04hpvs0v9q/tesstrain.log
[13:29:20] INFO - === Starting training for language eng
[13:29:20] INFO - Testing font: Arial
[13:29:24] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                  | 0/1 [00:00<?, ?it/s][13:29:24] INFO - Rendering using Arial
[13:29:24] INFO - Running text2image on /tmp/eng-2021-02-04hpvs0v9q/eng.Arial.exp0.000001.gt.txt
[13:29:26] INFO - Running text2image on /tmp/eng-2021-02-04hpvs0v9q/eng.Arial.exp0.000002.gt.txt

...

[13:31:49] INFO - Running text2image on /tmp/eng-2021-02-04hpvs0v9q/eng.Arial.exp0.000099.gt.txt
[13:31:50] INFO - Running text2image on /tmp/eng-2021-02-04hpvs0v9q/eng.Arial.exp0.000100.gt.txt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:27<00:00, 147.16s/it]
[13:31:51] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[13:31:52] INFO - === Phase E: Generating lstmf files ===
[13:31:52] INFO - Using TESSDATA_PREFIX=/home/ubuntu/tessdata_best
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:27<00:00,  3.63it/s]
[13:32:19] INFO - === Constructing LSTM training data ===
[13:32:19] INFO - Creating new directory data/engINR-ground-truth
[13:32:19] INFO - === Saving box/tiff pairs for training data ===
[13:32:19] INFO - === Moving lstmf files for training data ===
[13:32:19] INFO - All done!
mkdir -p data/engINR
mv -v data/engINR-ground-truth/eng.training_files.txt data/engINR/all-lstmf
renamed 'data/engINR-ground-truth/eng.training_files.txt' -> 'data/engINR/all-lstmf'
mv -v data/engINR-ground-truth/eng/eng.* data/engINR/
renamed 'data/engINR-ground-truth/eng/eng.charset_size=107.txt' -> 'data/engINR/eng.charset_size=107.txt'
renamed 'data/engINR-ground-truth/eng/eng.traineddata' -> 'data/engINR/eng.traineddata'
renamed 'data/engINR-ground-truth/eng/eng.unicharset' -> 'data/engINR/eng.unicharset'
rename "s/eng\./engINR\./g" data/engINR/*.*
Makefile-font2model:149: update target 'data/engINR/list.train' due to: data/engINR/all-lstmf
mkdir -p data/engINR
total=$(wc -l < data/engINR/all-lstmf); \
  train=$(echo "$total * 0.90 / 1" | bc); \
  test "$train" = "0" && \
    echo "Error: missing ground truth for training" && exit 1; \
  eval=$(echo "$total - $train" | bc); \
  test "$eval" = "0" && \
    echo "Error: missing ground truth for evaluation" && exit 1; \
  set -x; \
  head -n "$train" data/engINR/all-lstmf > "data/engINR/list.train"; \
  tail -n "$eval" data/engINR/all-lstmf > "data/engINR/list.eval"
+ head -n 89 data/engINR/all-lstmf
+ tail -n 10 data/engINR/all-lstmf
Makefile-font2model:173: update target 'data/engINR/checkpoints/engINR_checkpoint' due to: proto_model lists
mkdir -p data/eng
combine_tessdata -e /home/ubuntu/tessdata_best/eng.traineddata  data/eng/engINR.lstm
Extracting tessdata components from /home/ubuntu/tessdata_best/eng.traineddata
Wrote data/eng/engINR.lstm
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
mkdir -p data/engINR/checkpoints
lstmtraining \
  --continue_from data/eng/engINR.lstm --old_traineddata /home/ubuntu/tessdata_best/eng.traineddata  \
  --traineddata data/engINR/engINR.traineddata \
  --train_listfile data/engINR/list.train \
  --eval_listfile data/engINR/list.eval \
  --max_iterations -100 \
  --debug_interval -1 \
  --learning_rate 0.0001 \
  --target_error_rate 0.01 \
  --model_output data/engINR/checkpoints/engINR \
  > data/engINR.log 2>&1
^CMakefile-font2model:173: recipe for target 'data/engINR/checkpoints/engINR_checkpoint' failed
make: *** [data/engINR/checkpoints/engINR_checkpoint] Interrupt
Shreeshrii commented 3 years ago

(base) ubuntu@tesseract-ocr-1:~/tesstrain$ bash -x engINR.sh eng Latin eng engINR FineTune
+ make -f Makefile-font2model MODEL_NAME=engINR clean-groundtruth clean-output clean-log
Makefile-font2model:212: warning: overriding recipe for target 'data/Latin.unicharset'
Makefile-font2model:209: warning: ignoring old recipe for target 'data/Latin.unicharset'
rm -rf data/engINR-ground-truth
rm -rf data/engINR
rm -rf data/engINR.log
+ make -f Makefile-font2model TESSDATA=/home/ubuntu/tessdata_best TESSTRAIN_FONTS_DIR=/usr/share/fonts TESSTRAIN_TEXT=data/engINR.training_text TESSTRAIN_MAX_LINES=5 EPOCHS=100 TESSTRAIN_LANG=eng TESSTRAIN_SCRIPT=Latin START_MODEL=eng MODEL_NAME=engINR TRAIN_TYPE=FineTune DEBUG_INTERVAL=-1 training --trace
Makefile-font2model:212: warning: overriding recipe for target 'data/Latin.unicharset'
Makefile-font2model:209: warning: ignoring old recipe for target 'data/Latin.unicharset'
make: Circular data/Latin.unicharset <- data/Latin.unicharset dependency dropped.
Makefile-font2model:192: update target 'data/engINR/engINR.traineddata' due to: data/Latin.unicharset
python ./src/training/tesstrain.py \
 --fonts_dir /usr/share/fonts \
  \
 --maxpages 5 \
 --lang eng \
 --langdata_dir data \
 --training_text data/engINR.training_text \
 --tessdata_dir /home/ubuntu/tessdata_best \
 --linedata_only --noextract_font_properties \
 --exposures "0" --save_box_tiff  \
 --output_dir data/engINR-ground-truth
[13:27:50] INFO - Log file location: /tmp/eng-2021-02-045l7tek0u/tesstrain.log
[13:27:50] INFO - === Starting training for language eng
[13:27:50] INFO - Testing font: Arial Bold
[13:27:54] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                 | 0/32 [00:00<?, ?it/s][13:27:54] INFO - Rendering using Arial Bold
[13:27:54] INFO - Rendering using Arial Bold Italic
[13:27:54] INFO - Rendering using Arial Italic
[13:27:54] INFO - Rendering using Arial
[13:27:54] INFO - Rendering using Courier New Bold
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold.exp0.000001.gt.txt
[13:27:54] INFO - Rendering using Courier New Bold Italic
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold_Italic.exp0.000001.gt.txt
[13:27:54] INFO - Rendering using Courier New Italic
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Italic.exp0.000001.gt.txt
[13:27:54] INFO - Rendering using Courier New
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial.exp0.000001.gt.txt
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold.exp0.000001.gt.txt
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold_Italic.exp0.000001.gt.txt
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Italic.exp0.000001.gt.txt
[13:27:54] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New.exp0.000001.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold_Italic.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold_Italic.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Italic.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Italic.exp0.000002.gt.txt
[13:27:56] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New.exp0.000002.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold_Italic.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold_Italic.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Italic.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Italic.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New.exp0.000003.gt.txt
[13:27:57] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold.exp0.000003.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold_Italic.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Italic.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold_Italic.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Italic.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold.exp0.000004.gt.txt
[13:27:59] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New.exp0.000004.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Bold_Italic.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Italic.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold_Italic.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Courier_New_Bold.exp0.000005.gt.txt
[13:28:00] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Arial_Italic.exp0.000005.gt.txt
[13:28:02] INFO - Rendering using Times New Roman, Bold
  3%|████▊                                                                                                                                                    | 1/32 [00:07<03:52,  7.51s/it][13:28:02] INFO - Rendering using Times New Roman, Bold Italic
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold.exp0.000001.gt.txt
[13:28:02] INFO - Rendering using Times New Roman, Italic
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold_Italic.exp0.000001.gt.txt
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Italic.exp0.000001.gt.txt
[13:28:02] INFO - Rendering using Times New Roman,
[13:28:02] INFO - Rendering using Georgia Bold
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman.exp0.000001.gt.txt
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold.exp0.000001.gt.txt
[13:28:02] INFO - Rendering using Georgia Italic
[13:28:02] INFO - Rendering using Georgia
[13:28:02] INFO - Rendering using Georgia Bold Italic
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Italic.exp0.000001.gt.txt
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold_Italic.exp0.000001.gt.txt
[13:28:02] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia.exp0.000001.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Italic.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold_Italic.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold_Italic.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Italic.exp0.000002.gt.txt
[13:28:03] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia.exp0.000002.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold_Italic.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Italic.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Italic.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia.exp0.000003.gt.txt
[13:28:05] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold_Italic.exp0.000003.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold_Italic.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Italic.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Italic.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia.exp0.000004.gt.txt
[13:28:06] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold_Italic.exp0.000004.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Italic.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold_Italic.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman_Bold.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Times_New_Roman.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold_Italic.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Bold.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia_Italic.exp0.000005.gt.txt
[13:28:08] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Georgia.exp0.000005.gt.txt
[13:28:09] INFO - Rendering using Trebuchet MS Bold
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold.exp0.000001.gt.txt
 28%|███████████████████████████████████████████                                                                                                              | 9/32 [00:14<02:07,  5.54s/it][13:28:09] INFO - Rendering using Trebuchet MS Bold Italic
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold_Italic.exp0.000001.gt.txt
[13:28:09] INFO - Rendering using Trebuchet MS Italic
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Italic.exp0.000001.gt.txt
[13:28:09] INFO - Rendering using Trebuchet MS
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS.exp0.000001.gt.txt
[13:28:09] INFO - Rendering using Verdana Bold
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold.exp0.000001.gt.txt
[13:28:09] INFO - Rendering using Verdana Italic
[13:28:09] INFO - Rendering using Verdana
[13:28:09] INFO - Rendering using Verdana Bold Italic
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Italic.exp0.000001.gt.txt
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana.exp0.000001.gt.txt
[13:28:09] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold_Italic.exp0.000001.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold_Italic.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Italic.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Italic.exp0.000002.gt.txt
[13:28:11] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold_Italic.exp0.000002.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Italic.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold_Italic.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Italic.exp0.000003.gt.txt
[13:28:12] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold_Italic.exp0.000003.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold_Italic.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Italic.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Italic.exp0.000004.gt.txt
[13:28:14] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold_Italic.exp0.000004.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Italic.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold_Italic.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Trebuchet_MS_Bold.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Bold_Italic.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana_Italic.exp0.000005.gt.txt
[13:28:15] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Verdana.exp0.000005.gt.txt
[13:28:17] INFO - Rendering using Tex Gyre Bonum Bold
[13:28:17] INFO - Rendering using Tex Gyre Bonum Italic
 53%|████████████████████████████████████████████████████████████████████████████████▊                                                                       | 17/32 [00:22<01:02,  4.16s/it][13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold.exp0.000001.gt.txt
[13:28:17] INFO - Rendering using Tex Gyre Bonum Bold Italic
[13:28:17] INFO - Rendering using Tex Gyre Schola Bold
[13:28:17] INFO - Rendering using Tex Gyre Schola Italic
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Italic.exp0.000001.gt.txt
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold.exp0.000001.gt.txt
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold_Italic.exp0.000001.gt.txt
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Italic.exp0.000001.gt.txt
[13:28:17] INFO - Rendering using Tex Gyre Schola Bold Italic
[13:28:17] INFO - Rendering using Tex Gyre Schola Regular
[13:28:17] INFO - Rendering using DejaVu Sans Ultra-Light
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold_Italic.exp0.000001.gt.txt
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Regular.exp0.000001.gt.txt
[13:28:17] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.DejaVu_Sans_Ultra-Light.exp0.000001.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold_Italic.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Italic.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Italic.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold_Italic.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Regular.exp0.000002.gt.txt
[13:28:18] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.DejaVu_Sans_Ultra-Light.exp0.000002.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold_Italic.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Italic.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Italic.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Regular.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.DejaVu_Sans_Ultra-Light.exp0.000003.gt.txt
[13:28:20] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold_Italic.exp0.000003.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold_Italic.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Italic.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Italic.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Regular.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold_Italic.exp0.000004.gt.txt
[13:28:21] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.DejaVu_Sans_Ultra-Light.exp0.000004.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Bold_Italic.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Bonum_Italic.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Italic.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Regular.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.Tex_Gyre_Schola_Bold_Italic.exp0.000005.gt.txt
[13:28:23] INFO - Running text2image on /tmp/eng-2021-02-045l7tek0u/eng.DejaVu_Sans_Ultra-Light.exp0.000005.gt.txt
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:30<00:00,  1.06it/s]
[13:28:24] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[13:28:24] INFO - === Phase E: Generating lstmf files ===
[13:28:24] INFO - Using TESSDATA_PREFIX=/home/ubuntu/tessdata_best
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:44<00:00,  3.57it/s]
[13:29:09] INFO - === Constructing LSTM training data ===
[13:29:09] INFO - Creating new directory data/engINR-ground-truth
[13:29:09] INFO - === Saving box/tiff pairs for training data ===
[13:29:09] INFO - === Moving lstmf files for training data ===
[13:29:10] INFO - All done!
mkdir -p data/engINR
mv -v data/engINR-ground-truth/eng.training_files.txt data/engINR/all-lstmf
renamed 'data/engINR-ground-truth/eng.training_files.txt' -> 'data/engINR/all-lstmf'
mv -v data/engINR-ground-truth/eng/eng.* data/engINR/
renamed 'data/engINR-ground-truth/eng/eng.charset_size=107.txt' -> 'data/engINR/eng.charset_size=107.txt'
renamed 'data/engINR-ground-truth/eng/eng.traineddata' -> 'data/engINR/eng.traineddata'
renamed 'data/engINR-ground-truth/eng/eng.unicharset' -> 'data/engINR/eng.unicharset'
rename "s/eng\./engINR\./g" data/engINR/*.*
Makefile-font2model:149: update target 'data/engINR/list.train' due to: data/engINR/all-lstmf
mkdir -p data/engINR
total=$(wc -l < data/engINR/all-lstmf); \
  train=$(echo "$total * 0.90 / 1" | bc); \
  test "$train" = "0" && \
    echo "Error: missing ground truth for training" && exit 1; \
  eval=$(echo "$total - $train" | bc); \
  test "$eval" = "0" && \
    echo "Error: missing ground truth for evaluation" && exit 1; \
  set -x; \
  head -n "$train" data/engINR/all-lstmf > "data/engINR/list.train"; \
  tail -n "$eval" data/engINR/all-lstmf > "data/engINR/list.eval"
+ head -n 143 data/engINR/all-lstmf
+ tail -n 16 data/engINR/all-lstmf
Makefile-font2model:173: update target 'data/engINR/checkpoints/engINR_checkpoint' due to: proto_model lists
mkdir -p data/eng
combine_tessdata -e /home/ubuntu/tessdata_best/eng.traineddata  data/eng/engINR.lstm
Extracting tessdata components from /home/ubuntu/tessdata_best/eng.traineddata
Wrote data/eng/engINR.lstm
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
mkdir -p data/engINR/checkpoints
lstmtraining \
  --continue_from data/eng/engINR.lstm --old_traineddata /home/ubuntu/tessdata_best/eng.traineddata  \
  --traineddata data/engINR/engINR.traineddata \
  --train_listfile data/engINR/list.train \
  --eval_listfile data/engINR/list.eval \
  --max_iterations -100 \
  --debug_interval -1 \
  --learning_rate 0.0001 \
  --target_error_rate 0.01 \
  --model_output data/engINR/checkpoints/engINR \
  > data/engINR.log 2>&1
^CMakefile-font2model:173: recipe for target 'data/engINR/checkpoints/engINR_checkpoint' failed
make: *** [data/engINR/checkpoints/engINR_checkpoint] Interrupt
Shreeshrii commented 3 years ago

Also ask someone else, because I'm not really familiar with training tools scripts.

@egorpugin Thanks for the quick review of src/training/tesstrain_utils.py. I thought you will be most familiar with it since you converted it from bash to python.

egorpugin commented 3 years ago

It was not me. :)

binarymachine-91 commented 3 years ago

I tried running with engINR.sh. The first leg completed and when I tried to run the plot.sh it failed. I noticed that PKG_CONFIG_PATH did not appear for me in the end. I am attaching the log file of the run. I also noticed that under data/engINR/checkpoints/engINR_0.03_290_2400.checkpoint was not created for me. I looked into the data/engINR.log and found that I have 100% error. Can you tell me what I am doing worng. I am attaching the log file of the run and also the engINR.log file. Thank you. engINR.log pbINR.log

Shreeshrii commented 3 years ago

@vigneshg10 What's the version of tesseract you are using? I use the latest code from master branch which supports a negative value for max_iterations and treats it as epochs. It seems to me that you are using an older version.

Delete EPOCHS=100 from the bash script - it should use default MAX_ITERATIONS of 10000.

Shreeshrii commented 3 years ago

data/engINR.log should look like this:

Loaded file data/eng/engINR.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 107!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc107:107, 54891
Total weights = 1458955
Previous null char=110 mapped to 106
Continuing from data/eng/engINR.lstm
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000014.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000021.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000027.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000043.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000008.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000047.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000053.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000025.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000049.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000035.lstmf
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000069.lstmf
Iteration 0: GROUND  TRUTH : MARK Download Click Wii continuing. Silverlight Japan, OF shot. the Reef
File data/engINR-ground-truth/eng.Arial.exp0.000014.lstmf line 0 (Perfect):
Mean rms=0.155%, delta=0%, train=0%(0%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000046.lstmf
Iteration 1: GROUND  TRUTH : © University # THAT reformat First AGAINST Eaton YOU
File data/engINR-ground-truth/eng.Arial.exp0.000021.lstmf line 0 (Perfect):
Mean rms=0.144%, delta=0%, train=0%(0%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000099.lstmf
Iteration 2: GROUND  TRUTH : Broadstairs Shaving CURRENT [Copyright] half, CONVENIENT PRODUCTION
File data/engINR-ground-truth/eng.Arial.exp0.000027.lstmf line 0 (Perfect):
Mean rms=0.16%, delta=0%, train=0%(0%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000090.lstmf
Iteration 3: GROUND  TRUTH : PERTURBATIVE METHODS Publications from DRINK 'Ware PhotoFx
Iteration 3: BEST OCR TEXT : PERTURBATIVE METHODS Pubilications from DRINK 'Ware PhotoFx
File data/engINR-ground-truth/eng.Arial.exp0.000043.lstmf line 0 :
Mean rms=0.242%, delta=0.129%, train=0.431%(3.571%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000093.lstmf
Iteration 4: GROUND  TRUTH : Member 10% WERE Why several Key ON Vite FAVORITE producer, SYSTEMS}, Sight
File data/engINR-ground-truth/eng.Arial.exp0.000008.lstmf line 0 (Perfect):
Mean rms=0.22%, delta=0.103%, train=0.345%(2.857%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000015.lstmf
Iteration 5: GROUND  TRUTH : ACOUSTIC Gallery COMMENT known 2004 Not. VIOLENT, Kingdom 36.8 TANK MAKE:
File data/engINR-ground-truth/eng.Arial.exp0.000053.lstmf line 0 (Perfect):
Mean rms=0.204%, delta=0.086%, train=0.287%(2.381%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000096.lstmf
Iteration 6: GROUND  TRUTH : processors 0" PURIFIED Giants. § DIAGNOSTIC ~ PATENT DANSK FAX Very
File data/engINR-ground-truth/eng.Arial.exp0.000025.lstmf line 0 (Perfect):
Mean rms=0.214%, delta=0.074%, train=0.246%(2.041%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/engINR-ground-truth/eng.Arial.exp0.000071.lstmf
Iteration 7: GROUND  TRUTH : United Live * | TOWN, HAVE Vitamin videos 4 § with Values GUARD
File data/engINR-ground-truth/eng.Arial.exp0.000047.lstmf line 0 (Perfect):
binarymachine-91 commented 3 years ago

After deleting EPOCHS=100 the engINR.sh went through with 10000 iterations. plot.sh is not going through. Attaching the log file pbplot.log and also the tesseract version file. Kindly let me know the error I am making. Thank you. pbplot.log tesver.txt

Shreeshrii commented 3 years ago

Look in data/engINR/plot directory

On Fri, Feb 5, 2021 at 7:36 PM Vignesh G notifications@github.com wrote:

After deleting EPOCHS=100 the engINR.sh went through with 10000 iterations. plot.sh is not going through. Attaching the log file pbplot.log and also the tesseract version file. Kindly let me know the error I am making. Thank you. pbplot.log https://github.com/tesseract-ocr/tesstrain/files/5932940/pbplot.log tesver.txt https://github.com/tesseract-ocr/tesstrain/files/5932945/tesver.txt

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/pull/230#issuecomment-774052969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3E5YEPJMNRTKF6P2TS5P3OBANCNFSM4WOMZQOA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

binarymachine-91 commented 3 years ago

These are the files under the plot directory. plotdir.txt I have added .txt in order to upload the files. plot-eval-validate-cer.py.txt Makefile.txt

binarymachine-91 commented 3 years ago

Sorry. Looked at the wrong directory. Here are the files.

engINR-eval-cer.png.txt engINR-eval-cer.tsv.txt

binarymachine-91 commented 3 years ago

I

@vigneshg10 What's the version of tesseract you are using? I use the latest code from master branch which supports a negative value for max_iterations and treats it as epochs. It seems to me that you are using an older version.

Delete EPOCHS=100 from the bash script - it should use default MAX_ITERATIONS of 10000.

======= I now used the latest tesseract. tesseract 5.0.0-alpha-20201231 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found FMA Found SSE Found OpenMP 201511 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 Now went back to original engINR.sh. When it comes to the end lstmtraining does not accept max iterations -10. Kindly let me know why it fails. The lstmtraining version is 5.0.0-alpha-20201231. If i remove EPOCHS=10 , it starts working. Kindly advise me how to proceed.

Shreeshrii commented 3 years ago

I build directly from master branch.

That change was made 26 days ago

https://github.com/tesseract-ocr/tesseract/search?q=epochs&type=

You can clone the repo and build using that.

On Tue, Feb 9, 2021, 16:18 Vignesh G notifications@github.com wrote:

I

@vigneshg10 https://github.com/vigneshg10 What's the version of tesseract you are using? I use the latest code from master branch which supports a negative value for max_iterations and treats it as epochs. It seems to me that you are using an older version.

Delete EPOCHS=100 from the bash script - it should use default MAX_ITERATIONS of 10000.

======= I now used the latest tesseract. tesseract 5.0.0-alpha-20201231 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found FMA Found SSE Found OpenMP 201511 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 Now went back to original engINR.sh. When it comes to the end lstmtraining does not accept max iterations -10. Kindly let me know why it fails. The lstmtraining version is 5.0.0-alpha-20201231. If i remove EPOCHS=10 , it starts working. Kindly advise me how to proceed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/pull/230#issuecomment-775849154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZXXTM5WUJVXZE3P53S6EHHNANCNFSM4WOMZQOA .

binarymachine-91 commented 3 years ago

I build directly from master branch. That change was made 26 days ago https://github.com/tesseract-ocr/tesseract/search?q=epochs&type= You can clone the repo and build using that. On Tue, Feb 9, 2021, 16:18 Vignesh G @.***> wrote: I @vigneshg10 https://github.com/vigneshg10 What's the version of tesseract you are using? I use the latest code from master branch which supports a negative value for max_iterations and treats it as epochs. It seems to me that you are using an older version. Delete EPOCHS=100 from the bash script - it should use default MAX_ITERATIONS of 10000. ======= I now used the latest tesseract. tesseract 5.0.0-alpha-20201231 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found FMA Found SSE Found OpenMP 201511 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1 Now went back to original engINR.sh. When it comes to the end lstmtraining does not accept max iterations -10. Kindly let me know why it fails. The lstmtraining version is 5.0.0-alpha-20201231. If i remove EPOCHS=10 , it starts working. Kindly advise me how to proceed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#230 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZXXTM5WUJVXZE3P53S6EHHNANCNFSM4WOMZQOA .

Thank you. Updated lstmtraining using the link provided by you. Now it is working with EPOCHS.

bertsky commented 3 years ago

@Shreeshrii could you please enlighten me as to why this was closed, and what's the relationship in general between make/lstmtraining for real GT and tesstrain.py for synthetic training? (I find it hard to relate the current status to the original training tutorial...)