tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Specifying output dir in training causes crash due to missing trainingdata #153

Closed hollowgalaxy closed 4 years ago

hollowgalaxy commented 4 years ago

cd /tesstrain && make training MODEL_NAME=$model LANG_TYPE=$lang GROUND_TRUTH_DIR=$gt_dir MAX_ITERATIONS=10000 PSM=$PSM.SINGLE_BLOCK OEM=$OEM.LSTM_ONLY This works

cd /tesstrain && make training OUTPUT_DIR=$out_dir MODEL_NAME=$model LANG_TYPE=$lang GROUND_TRUTH_DIR=$gt_dir MAX_ITERATIONS=10000 PSM=$PSM.SINGLE_BLOCK OEM=$OEM.LSTM_ONLY This crashes. Notice only change is specifying output_dir

hollowgalaxy commented 4 years ago

Log of error


  --input_unicharset /tesstrain/data/s10-output-dir/unicharset \
  --script_dir data \
  --numbers /tesstrain/data/s10-output-dir/s10.numbers \
  --puncs /tesstrain/data/s10-output-dir/s10.punc \
  --words /tesstrain/data/s10-output-dir/s10.wordlist \
  --output_dir data \
   \
  --lang s10
Failed to read data from: /tesstrain/data/s10-output-dir/s10.wordlist
Failed to read data from: /tesstrain/data/s10-output-dir/s10.punc
Failed to read data from: /tesstrain/data/s10-output-dir/s10.numbers
Loaded unicharset of size 4 from file /tesstrain/data/s10-output-dir/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/Latin.unicharset
Warning: properties incomplete for index 3 = 0
Config file is optional, continuing...
Failed to read data from: data/s10/s10.config
Null char=2
lstmtraining \
  --debug_interval 0 \
  --traineddata /tesstrain/data/s10-output-dir/s10.traineddata \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 /tesstrain/data/s10-output-dir/unicharset`]" \
  --model_output /tesstrain/data/s10-output-dir/checkpoints/s10 \
  --learning_rate 20e-4 \
  --train_listfile /tesstrain/data/s10-output-dir/list.train \
  --eval_listfile /tesstrain/data/s10-output-dir/list.eval \
  --max_iterations 10000
mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Makefile:266: recipe for target '/tesstrain/data/s10-output-dir/checkpoints/s10_checkpoint' failed
make: *** [/tesstrain/data/s10-output-dir/checkpoints/s10_checkpoint] Illegal instruction (core dumped)```
hollowgalaxy commented 4 years ago

This error is similar to https://github.com/tesseract-ocr/tesseract/issues/1075 which indicates that the traindata file needs to be in the output dir prior to training.

songzy12 commented 4 years ago

Recently I am using this tool to train an OCR model from ground up and met the same issue.

The reason is due to the dir mismatch for proto model for make proto-model and make training:

For proto-model

https://github.com/tesseract-ocr/tesstrain/blob/e9b375f9b3d293e51a0cfdb8e276e5f22af8d1e8/Makefile#L236-L248

Note here the --output_dir data option would write the generated $MODEL_NAME.traineddata under the ./data dir.

For trainining

https://github.com/tesseract-ocr/tesstrain/blob/e9b375f9b3d293e51a0cfdb8e276e5f22af8d1e8/Makefile#L35 https://github.com/tesseract-ocr/tesstrain/blob/e9b375f9b3d293e51a0cfdb8e276e5f22af8d1e8/Makefile#L282-L287

Note here --traineddata $(PROTO_MODEL) option would try to read $MODEL_NAME.traineddata from $OUTPUT_DIR.

However, the proto-model was generated under ./data dir as we discussed above, thus leading to the error.

songzy12 commented 4 years ago

One possible fix is to introduce another environment variable called DATA_DIR, and then

  1. set OUTPUT_DIR as $DATA_DIR/$MODEL_NAME
  2. set --output_dir $DATA_DIR in make proto-model

Would you like me to submit a pull request for this?

kba commented 4 years ago

One possible fix is to introduce another environment variable called DATA_DIR, and then

  1. set OUTPUT_DIR as $DATA_DIR/$MODEL_NAME
  2. set --output_dir $DATA_DIR in make proto-model

Would you like me to submit a pull request for this?

Sounds reasonable AFAICT :+1: A PR would help to make sure there are no undesired side-effects.

songzy12 commented 4 years ago

Ack. I would send the PR later. :)

songzy12 commented 4 years ago

@kba hi Konstantin, any comments about the current PR? :)