tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

make training problem with example dataset #47

Closed ofnanezn closed 5 years ago

ofnanezn commented 5 years ago

Hi,

I got a problem when I try to run the make training example shown in the README. I have installed tesseract and leptonica, but when execute the training command, I got the following message:

find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Failed to read data from: data/all-boxes
Wrote unicharset file data/unicharset
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-26 17:13:03--  https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-26 17:13:03--  https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’

data/radical-stroke.txt             100%[=================================================================>] 323,12K   835KB/s    in 0,4s    

2018-12-26 17:13:05 (835 KB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]

combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir data/ \
  --output_dir data/ \
  --lang test_model
Loaded unicharset of size 3 from file data/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Config file is optional, continuing...
Failed to read data from: data//test_model/test_model.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/test_model/test_model.traineddata \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/test_model \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000
Failed to load list of training filenames from data/list.train
Makefile:144: recipe for target 'data/checkpoints/test_model_checkpoint' failed
make: *** [data/checkpoints/test_model_checkpoint] Error 1

Thanks.

kba commented 5 years ago

Is there training data in data/ground-truth? The error is probably that data/list.train is empty. Can you share your training data?

ofnanezn commented 5 years ago

Oh.. sorry, I though that I had to put data in a new folder called data/train. I moved the training data to data/ground-truth and now is working.

Thanks.

kba commented 5 years ago

Glad you solved it, the README is out-of-date, I will update it asap (#39). Thanks for reporting.