tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

foo: Error: missing ground truth for training #224

Closed yoyos closed 3 years ago

yoyos commented 3 years ago

Hello,

I get a ""Error: missing ground truth for training" with the demo model. A bug or I missed something from documentation ?

I'm using https://github.com/tesseract-ocr/tessdata_best

Here is the logs with a make training --trace:

tesseract "${image}" data/foo-ground-truth/spielhagen_problematische02_1861_0100_009 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/spielhagen_problematische02_1861_0100_009.tif data/foo-ground-truth/spielhagen_problematische02_1861_0100_009 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:208: update target 'data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.box' due to: data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.tif data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.gt.txt
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.tif" -t "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.gt.txt" > "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.box"
Makefile:215: update target 'data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.lstmf' due to: data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.box
if test -f "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.png"; then \
  image="data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.png"; \
elif test -f "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.bin.png"; then \
  image="data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.bin.png"; \
elif test -f "data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.nrm.png"; then \
  image="data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.nrm.png"; \
else \
  image="data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.tif"; \
fi; \
set -x; \
tesseract "${image}" data/foo-ground-truth/raschdorff_hochbau_1880_0025_016 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.tif data/foo-ground-truth/raschdorff_hochbau_1880_0025_016 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:208: update target 'data/foo-ground-truth/keller_heinrich01_1854_0078_013.box' due to: data/foo-ground-truth/keller_heinrich01_1854_0078_013.tif data/foo-ground-truth/keller_heinrich01_1854_0078_013.gt.txt
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/keller_heinrich01_1854_0078_013.tif" -t "data/foo-ground-truth/keller_heinrich01_1854_0078_013.gt.txt" > "data/foo-ground-truth/keller_heinrich01_1854_0078_013.box"
Makefile:215: update target 'data/foo-ground-truth/keller_heinrich01_1854_0078_013.lstmf' due to: data/foo-ground-truth/keller_heinrich01_1854_0078_013.box
if test -f "data/foo-ground-truth/keller_heinrich01_1854_0078_013.png"; then \
  image="data/foo-ground-truth/keller_heinrich01_1854_0078_013.png"; \
elif test -f "data/foo-ground-truth/keller_heinrich01_1854_0078_013.bin.png"; then \
  image="data/foo-ground-truth/keller_heinrich01_1854_0078_013.bin.png"; \
elif test -f "data/foo-ground-truth/keller_heinrich01_1854_0078_013.nrm.png"; then \
  image="data/foo-ground-truth/keller_heinrich01_1854_0078_013.nrm.png"; \
else \
  image="data/foo-ground-truth/keller_heinrich01_1854_0078_013.tif"; \
fi; \
set -x; \
tesseract "${image}" data/foo-ground-truth/keller_heinrich01_1854_0078_013 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/keller_heinrich01_1854_0078_013.tif data/foo-ground-truth/keller_heinrich01_1854_0078_013 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:208: update target 'data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.box' due to: data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.tif data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.gt.txt
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.tif" -t "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.gt.txt" > "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.box"
Makefile:215: update target 'data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.lstmf' due to: data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.box
if test -f "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.png"; then \
  image="data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.png"; \
elif test -f "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.bin.png"; then \
  image="data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.bin.png"; \
elif test -f "data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.nrm.png"; then \
  image="data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.nrm.png"; \
else \
  image="data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.tif"; \
fi; \
set -x; \
tesseract "${image}" data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.tif data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:208: update target 'data/foo-ground-truth/spyri_heidi_1880_0062_005.box' due to: data/foo-ground-truth/spyri_heidi_1880_0062_005.tif data/foo-ground-truth/spyri_heidi_1880_0062_005.gt.txt
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/spyri_heidi_1880_0062_005.tif" -t "data/foo-ground-truth/spyri_heidi_1880_0062_005.gt.txt" > "data/foo-ground-truth/spyri_heidi_1880_0062_005.box"
Makefile:215: update target 'data/foo-ground-truth/spyri_heidi_1880_0062_005.lstmf' due to: data/foo-ground-truth/spyri_heidi_1880_0062_005.box
if test -f "data/foo-ground-truth/spyri_heidi_1880_0062_005.png"; then \
  image="data/foo-ground-truth/spyri_heidi_1880_0062_005.png"; \
elif test -f "data/foo-ground-truth/spyri_heidi_1880_0062_005.bin.png"; then \
  image="data/foo-ground-truth/spyri_heidi_1880_0062_005.bin.png"; \
elif test -f "data/foo-ground-truth/spyri_heidi_1880_0062_005.nrm.png"; then \
  image="data/foo-ground-truth/spyri_heidi_1880_0062_005.nrm.png"; \
else \
  image="data/foo-ground-truth/spyri_heidi_1880_0062_005.tif"; \
fi; \
set -x; \
tesseract "${image}" data/foo-ground-truth/spyri_heidi_1880_0062_005 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/spyri_heidi_1880_0062_005.tif data/foo-ground-truth/spyri_heidi_1880_0062_005 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:208: update target 'data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.box' due to: data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.tif data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.gt.txt
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.tif" -t "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.gt.txt" > "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.box"
Makefile:215: update target 'data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.lstmf' due to: data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.box
if test -f "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.png"; then \
  image="data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.png"; \
elif test -f "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.bin.png"; then \
  image="data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.bin.png"; \
elif test -f "data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.nrm.png"; then \
  image="data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.nrm.png"; \
else \
  image="data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.tif"; \
fi; \
set -x; \
tesseract "${image}" data/foo-ground-truth/novalis_ofterdingen_1802_0090_011 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.tif data/foo-ground-truth/novalis_ofterdingen_1802_0090_011 --psm 13 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-20201224 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Makefile:211: update target 'data/foo/all-lstmf' due to: data/foo-ground-truth/novalis_ofterdingen_1802_0032_001.lstmf data/foo-ground-truth/clauren_liebe_1827_0205_005.lstmf data/foo-ground-truth/keller_sinngedicht_1882_0301_010.lstmf data/foo-ground-truth/wienbarg_feldzuege_1834_0188_011.lstmf data/foo-ground-truth/fiedler_kuenstlerische_1887_0028_022.lstmf data/foo-ground-truth/frapan_bittersuess_1891_0275_003.lstmf data/foo-ground-truth/menzel_literatur01_1828_0060_015.lstmf data/foo-ground-truth/poersch_gewerkschaftsbewegung_1897_0037_015.lstmf (etc...) data/foo-ground-truth/spielhagen_problematische02_1861_0100_009.lstmf data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.lstmf data/foo-ground-truth/keller_heinrich01_1854_0078_013.lstmf data/foo-ground-truth/schleiden_menschengeschlecht_1863_0059_034.lstmf data/foo-ground-truth/spyri_heidi_1880_0062_005.lstmf data/foo-ground-truth/novalis_ofterdingen_1802_0090_011.lstmf
mkdir -p data/foo
find data/foo-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/foo/all-lstmf"
Makefile:166: update target 'data/foo/list.train' due to: data/foo/all-lstmf
mkdir -p data/foo
total=$(wc -l < data/foo/all-lstmf); \
  train=$(echo "$total * 0.90 / 1" | bc); \
  test "$train" = "0" && \
    echo "Error: missing ground truth for training" && exit 1; \
  eval=$(echo "$total - $train" | bc); \
  test "$eval" = "0" && \
    echo "Error: missing ground truth for evaluation" && exit 1; \
  set -x; \
  head -n "$train" data/foo/all-lstmf > "data/foo/list.train"; \
  tail -n "$eval" data/foo/all-lstmf > "data/foo/list.eval"
Error: missing ground truth for training
make: *** [Makefile:167: data/foo/list.train] Error 1
yoyos commented 3 years ago

Ok maybe this will help someone.

I didn't pay attention that there were some submodules to load that contains the lstm.train file ... Thanks documentation !

Here is the git clone command that will clone included submodules

git clone --recurse-submodules https://github.com/tesseract-ocr/tessdata_best