tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

Failed to load list of eval filenames from data/list.eval / Segmentation fault #40

Closed aaronk6 closed 5 years ago

aaronk6 commented 5 years ago

Hi guys,

I’m pretty new to this, so please forgive if I’m missing something obvious.

I’ve posted to the mailing list because Tesseract sometimes confuses the digit 4 with a 9 in the material I’m currently processing. Someone over there pointed me to your project.

So what I would like to do now is to finetune the Latin script to fix the recognition errors I’m seeing. If I understand this correctly, I’ll need to go to data/ground-truth and create files there for Tesseract to learn from, e.g:

_April2014.gt.txt

April 2014

_April2014.tif

april_2014

Also, I’ve cloned https://github.com/tesseract-ocr/tessdata.

What else do I need to do? The reason I’m asking is because I get an error when executing the following:

$ make -j4 training START_MODEL=Latin TESSDATA=/home/vagrant/tessdata/script
python generate_line_box.py -i "data/ground-truth/April_2012.tif" -t "data/ground-truth/April_2012.gt.txt" > "data/ground-truth/April_2012.box"
python generate_line_box.py -i "data/ground-truth/April_2013.tif" -t "data/ground-truth/April_2013.gt.txt" > "data/ground-truth/April_2013.box"
python generate_line_box.py -i "data/ground-truth/April_2014.tif" -t "data/ground-truth/April_2014.gt.txt" > "data/ground-truth/April_2014.box"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-11 00:00:10--  https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... tesseract data/ground-truth/April_2012.tif data/ground-truth/April_2012 --psm 6 lstm.train
tesseract data/ground-truth/April_2014.tif data/ground-truth/April_2014 --psm 6 lstm.train
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
connected.
tesseract data/ground-truth/April_2013.tif data/ground-truth/April_2013 --psm 6 lstm.train
HTTP request sent, awaiting response... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
mkdir -p data/Latin
combine_tessdata -u /home/vagrant/tessdata/script/Latin.traineddata  data/Latin/Latin
302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-11 00:00:10--  https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’

data/radical-stroke.txt                                        0%[                                                                                                                                               ]       0  --.-KB/s               Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Extracting tessdata components from /home/vagrant/tessdata/script/Latin.traineddata
Wrote data/Latin/Latin.lstm
Wrote data/Latin/Latin.lstm-punc-dawg
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
data/radical-stroke.txt                                      100%[==============================================================================================================================================>] 323.12K  --.-KB/s    in 0.1s

2018-12-11 00:00:10 (2.40 MB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]

total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
Wrote data/Latin/Latin.lstm-word-dawg
Wrote data/Latin/Latin.lstm-number-dawg
Wrote data/Latin/Latin.lstm-unicharset
Wrote data/Latin/Latin.lstm-recoder
Wrote data/Latin/Latin.version
Version string:4.00.00alpha:Latin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=1587099, offset=192
18:lstm-punc-dawg:size=5954, offset=1587291
19:lstm-word-dawg:size=88816882, offset=1593245
20:lstm-number-dawg:size=86050, offset=90410127
21:lstm-unicharset:size=18023, offset=90496177
22:lstm-recoder:size=2735, offset=90514200
23:version:size=82, offset=90516935
unicharset_extractor --output_unicharset "data/ground-truth/my.unicharset" --norm_mode 2 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Other case a of A is not in unicharset
Other case P of p is not in unicharset
Other case R of r is not in unicharset
Other case I of i is not in unicharset
Other case L of l is not in unicharset
Wrote unicharset file data/ground-truth/my.unicharset
merge_unicharsets data/Latin/Latin.lstm-unicharset data/ground-truth/my.unicharset  "data/unicharset"
Loaded unicharset of size 303 from file data/Latin/Latin.lstm-unicharset
Loaded unicharset of size 13 from file data/ground-truth/my.unicharset
Wrote unicharset file data/unicharset.
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir data/ \
  --output_dir data/ \
  --lang foo
Loaded unicharset of size 303 from file data/unicharset
Setting unichar properties
Other case Ẹ̀ of ẹ̀ is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = K
#
# ... truncated ...
#
Warning: properties incomplete for index 302 = ẹ̀
Config file is optional, continuing...
Failed to read data from: data//foo/foo.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/foo/foo.traineddata \
          --old_traineddata /home/vagrant/tessdata/script/Latin.traineddata \
  --continue_from data/Latin/Latin.lstm \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/foo \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000
Loaded file data/Latin/Latin.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 302 to 302!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc302:302, 0
Total weights = 1404064
Previous null char=301 mapped to 301
Continuing from data/Latin/Latin.lstm
Loaded 1/1 pages (1-1) of document data/ground-truth/April_2014.lstmf
Failed to load list of eval filenames from data/list.eval
Failed to load eval data from: data/list.eval
Makefile:131: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** [data/checkpoints/foo_checkpoint] Segmentation fault (core dumped)

Looks like it wants a data/list.eval file which isn’t there. Is this why it’s crashing?

I’m running this on Ubuntu 16.04.

Thank you!

wrznr commented 5 years ago

Your GT set is not big enough. We are dividing the GT data into training (90 %) and evaluation data (10 %). Having just three lines of GT leaves the evaluation set empty.

wrznr commented 5 years ago

Will be handled via https://github.com/OCR-D/ocrd-train/issues/42

aaronk6 commented 5 years ago

Hi @wrznr, thanks for looking into this. I actually thought it would be smart to start with a small set to see if the process is working before feeding it with a bigger set, but apparently that wasn’t the case 🙂

lokesh-stack commented 5 years ago

can someone help with this error:

find data/ground-truth -name '.lstmf' | python3 shuffle.py 0 > "data/foo/all-lstmf" mkdir -p data/foo total=$(wc -l < data/foo/all-lstmf); \ train=$(echo "$total 0.90 / 1" | bc); \ test "$train" = "0" && \ echo "Error: missing ground truth for training" && exit 1; \ eval=$(echo "$total - $train" | bc); \ test "$eval" = "0" && \ echo "Error: missing ground truth for evaluation" && exit 1; \ head -n "$train" data/foo/all-lstmf > "data/foo/list.train"; \ tail -n "$eval" data/foo/all-lstmf > "data/foo/list.eval" Error: missing ground truth for training Makefile:106: recipe for target 'data/foo/list.train' failed make: *** [data/foo/list.train] Error 1

wrznr commented 5 years ago

The make file misses files for training. Can you check the directory data/ground-truth for files with the suffix .lstmf?

royudev commented 4 years ago

@wrznr i have the same error with lokesh-stack

i used the image and ground-truth text from ocrd-testset.zip and put them in data/foo-ground-truth i used the command make training after running that command the error Error: missing ground truth for training is shown

iknoorjobs commented 4 years ago

@wrznr i have the same error with lokesh-stack

i used the image and ground-truth text from ocrd-testset.zip and put them in data/foo-ground-truth i used the command make training after running that command the error Error: missing ground truth for training is shown

@wrznr @royudev Yes, even the sample dataset ocrd-testset.zip is failing and showing this error

/bin/bash: line 4: bc: command not found + head -n '' data/foo/all-lstmf head: invalid number of lines: '' + tail -n '' data/foo/all-lstmf tail: invalid number of lines: '' Makefile:165: recipe for target 'data/foo/list.train' failed make: *** [data/foo/list.train] Error 1

Please tell how I can fix this? Thanks

Shreeshrii commented 4 years ago

/bin/bash: line 4: bc: command not found

bc is needed for the line number calculations. Please install that.