Closed hollowgalaxy closed 4 years ago
Log of error
--input_unicharset /tesstrain/data/s10-output-dir/unicharset \
--script_dir data \
--numbers /tesstrain/data/s10-output-dir/s10.numbers \
--puncs /tesstrain/data/s10-output-dir/s10.punc \
--words /tesstrain/data/s10-output-dir/s10.wordlist \
--output_dir data \
\
--lang s10
Failed to read data from: /tesstrain/data/s10-output-dir/s10.wordlist
Failed to read data from: /tesstrain/data/s10-output-dir/s10.punc
Failed to read data from: /tesstrain/data/s10-output-dir/s10.numbers
Loaded unicharset of size 4 from file /tesstrain/data/s10-output-dir/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/Latin.unicharset
Warning: properties incomplete for index 3 = 0
Config file is optional, continuing...
Failed to read data from: data/s10/s10.config
Null char=2
lstmtraining \
--debug_interval 0 \
--traineddata /tesstrain/data/s10-output-dir/s10.traineddata \
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 /tesstrain/data/s10-output-dir/unicharset`]" \
--model_output /tesstrain/data/s10-output-dir/checkpoints/s10 \
--learning_rate 20e-4 \
--train_listfile /tesstrain/data/s10-output-dir/list.train \
--eval_listfile /tesstrain/data/s10-output-dir/list.eval \
--max_iterations 10000
mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Makefile:266: recipe for target '/tesstrain/data/s10-output-dir/checkpoints/s10_checkpoint' failed
make: *** [/tesstrain/data/s10-output-dir/checkpoints/s10_checkpoint] Illegal instruction (core dumped)```
This error is similar to https://github.com/tesseract-ocr/tesseract/issues/1075 which indicates that the traindata file needs to be in the output dir prior to training.
Recently I am using this tool to train an OCR model from ground up and met the same issue.
The reason is due to the dir mismatch for proto model for make proto-model
and make training
:
Note here the --output_dir data
option would write the generated $MODEL_NAME.traineddata
under the ./data
dir.
https://github.com/tesseract-ocr/tesstrain/blob/e9b375f9b3d293e51a0cfdb8e276e5f22af8d1e8/Makefile#L35 https://github.com/tesseract-ocr/tesstrain/blob/e9b375f9b3d293e51a0cfdb8e276e5f22af8d1e8/Makefile#L282-L287
Note here --traineddata $(PROTO_MODEL)
option would try to read $MODEL_NAME.traineddata
from $OUTPUT_DIR
.
However, the proto-model was generated under ./data
dir as we discussed above, thus leading to the error.
One possible fix is to introduce another environment variable called DATA_DIR
, and then
OUTPUT_DIR
as $DATA_DIR/$MODEL_NAME
--output_dir $DATA_DIR
in make proto-model
Would you like me to submit a pull request for this?
One possible fix is to introduce another environment variable called
DATA_DIR
, and then
- set
OUTPUT_DIR
as$DATA_DIR/$MODEL_NAME
- set
--output_dir $DATA_DIR
inmake proto-model
Would you like me to submit a pull request for this?
Sounds reasonable AFAICT :+1: A PR would help to make sure there are no undesired side-effects.
Ack. I would send the PR later. :)
@kba hi Konstantin, any comments about the current PR? :)
cd /tesstrain && make training MODEL_NAME=$model LANG_TYPE=$lang GROUND_TRUTH_DIR=$gt_dir MAX_ITERATIONS=10000 PSM=$PSM.SINGLE_BLOCK OEM=$OEM.LSTM_ONLY
This workscd /tesstrain && make training OUTPUT_DIR=$out_dir MODEL_NAME=$model LANG_TYPE=$lang GROUND_TRUTH_DIR=$gt_dir MAX_ITERATIONS=10000 PSM=$PSM.SINGLE_BLOCK OEM=$OEM.LSTM_ONLY
This crashes. Notice only change is specifying output_dir