tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

more make training errors with sample ground truth #58

Closed ameera3 closed 5 years ago

ameera3 commented 5 years ago

I am having an issue similar to https://github.com/OCR-D/ocrd-train/issues/47 except that I have already extracted ocrd-testset.zip to ./data/ground-truth.

I typed the commands: root@CUDA1:/home/ocrd-train# export PYTHONIOENCODING=utf8 root@CUDA1:/home/ocrd-train# make training

Output: tesseract data/ground-truth/alexis_ruhe01_1852_0018_022.tif data/ground-truth/alexis_ruhe01_1852_0018_022 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica Page 1 Warning. Invalid resolution 0 dpi. Using 70 instead. Failed to read boxes from data/ground-truth/alexis_ruhe01_1852_0018_022.tif . . . tesseract data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif data/ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica Page 1 Warning. Invalid resolution 0 dpi. Using 70 instead. Failed to read boxes from data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif find data/ground-truth -name '.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf" total=cat data/all-lstmf | wc -l \ no=`echo "$total 0.90 / 1" | bc; \ head -n "$no" data/all-lstmf > "data/list.train" total=cat data/all-lstmf | wc -l\ no=echo "($total - $total * 0.90) / 1" | bc`; \ tail -n "$no" data/all-lstmf > "data/list.eval" wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt' --2019-02-27 23:40:21-- https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt Resolving github.com (github.com)... 192.30.255.112, 192.30.255.113 Connecting to github.com (github.com)|192.30.255.112|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following] --2019-02-27 23:40:21-- https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133 Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 330874 (323K) [text/plain] Saving to: 'data/radical-stroke.txt'

data/radical-stroke.txt 100%[======================================================================>] 323.12K --.-KB/s in 0.03s

2019-02-27 23:40:22 (11.5 MB/s) - 'data/radical-stroke.txt' saved [330874/330874]

combine_lang_model \ --input_unicharset data/unicharset \ --script_dir data/ \ --output_dir data/ \ --lang foo Loaded unicharset of size 15 from file data/unicharset Setting unichar properties Other case A of a is not in unicharset Other case N of n is not in unicharset Other case D of d is not in unicharset Other case E of e is not in unicharset Other case R of r is not in unicharset Other case h of H is not in unicharset Other case F of f is not in unicharset Other case c of C is not in unicharset Other case V of v is not in unicharset Other case L of l is not in unicharset Other case I of i is not in unicharset Setting script properties Failed to load script unicharset from:data//Latin.unicharset Warning: properties incomplete for index 3 = a Warning: properties incomplete for index 4 = n Warning: properties incomplete for index 5 = d Warning: properties incomplete for index 6 = e Warning: properties incomplete for index 7 = r Warning: properties incomplete for index 8 = H Warning: properties incomplete for index 9 = f Warning: properties incomplete for index 10 = C Warning: properties incomplete for index 11 = v Warning: properties incomplete for index 12 = l Warning: properties incomplete for index 13 = i Warning: properties incomplete for index 14 = . Config file is optional, continuing... Failed to read data from: data//foo/foo.config Null char=2 mkdir -p data/checkpoints lstmtraining \ --traineddata data/foo/foo.traineddata \ --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset]" \ --model_output data/checkpoints/foo \ --learning_rate 20e-4 \ --train_listfile data/list.train \ --eval_listfile data/list.eval \ --max_iterations 10000 Warning: given outputs 15 not equal to unicharset of 14. Num outputs,weights in Series: 1,36,0,1:1, 0 Num outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx256:256, 361472 Fc14:14, 3598 Total weights = 507502 Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc14] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c15] Training parameters: Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5 null char=13 Loaded 1/1 pages (1-1) of document data/ground-truth/alexis_ruhe01_1852_0332_007.lstmf Failed to load list of eval filenames from data/list.eval Failed to load eval data from: data/list.eval Makefile:144: recipe for target 'data/checkpoints/foo_checkpoint' failed make: *** [data/checkpoints/foo_checkpoint] Error 1

ameera3 commented 5 years ago

Consulted https://github.com/OCR-D/ocrd-train/issues/40 My list.eval is empty, but I have already extracted ocrd-testset.zip to ./data/ground-truth.

ameera3 commented 5 years ago

Regarding the output: Failed to load script unicharset from:data//Latin.unicharset

I do not have a directory called Latin in my data directory. Why?

ameera3 commented 5 years ago

Regarding the output: Failed to read data from: data//foo/foo.config

I do have a foo directory in my data directory, but no foo.config within my foo directory.

ameera3 commented 5 years ago

make clean followed by make training seems to fix things for now...

root@CUDA1:/home/ocrd-train# make clean find data/ground-truth -name '.box' -delete find data/ground-truth -name '.lstmf' -delete rm -rf data/all- rm -rf data/list. rm -rf data/foo rm -rf data/unicharset rm -rf data/checkpoints root@CUDA1:/home/ocrd-train# make training