tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

make training Error: missing ground truth for training #147

Closed royudev closed 4 years ago

royudev commented 4 years ago

I've followed the instructions on how to train images but I keep on getting this error

PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/wackenroder_herzensergiessungen_1797_0051_001.tif" -t "data/foo-ground-truth/wackenroder_herzensergiessungen_1797_0051_001.gt.txt" > "data/foo-ground-truth/wackenroder_herzensergiessungen_1797_0051_001.box"
+ tesseract data/foo-ground-truth/wackenroder_herzensergiessungen_1797_0051_001.tif data/foo-ground-truth/wackenroder_herzensergiessungen_1797_0051_001 --psm 6 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-635-g90405 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
find data/foo-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/foo/all-lstmf"
Error: missing ground truth for training
Makefile:147: recipe for target 'data/foo/list.train' failed
make: *** [data/foo/list.train] Error 1

It keeps on showing this error Error: missing ground truth for training

command i used make training

the image and ground truth text are from the same repo ocrd-testset.zip

what could possibly the solution to fix this?

Shreeshrii commented 4 years ago

How are you running the training?

I ran it after a fresh install and it worked fine. See attached log.

git clone https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
mkdir data
unzip /home/ubuntu/tesstrain/ocrd-testset.zip -d data/foo-ground-truth
nohup make training & 

nohup.out.txt

royudev commented 4 years ago

i did the same step you did

git clone https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
mkdir data

Downloaded the zip file and unzipped it in the data/foo-ground-truth

make training
wrznr commented 4 years ago

@royudev I cannot reproduce your error. Could you please run the exact steps which @Shreeshrii proposed and post the file nohup.out.txt here? In addition, please run ls -l on data and on data/foo-ground-truth.

royudev commented 4 years ago

Hi @wrznr I was able to make it work, I've attached the nohup.out.txt nohup.out.txt

here's the ls -l of data

total 256
drwxrwxrwx 1 vagrant vagrant 262144 Mar 13 18:01 foo-ground-truth

weird thing is when i use only 1 .tif file and it's equivalent .gt.txt in the foo-ground-truth (for example: alexis_ruhe01_1852_0018_022.gt.txt and alexis_ruhe01_1852_0018_022.tif) i always get the Error: missing ground truth for training

Shreeshrii commented 4 years ago

You need lines for training as well as evaluation. The default ratio is 9:1 (I think). So, use at least 10 lines of text and image pairs.

ghost commented 4 years ago

Hi @wrznr I was able to make it work, I've attached the nohup.out.txt nohup.out.txt

here's the ls -l of data

total 256
drwxrwxrwx 1 vagrant vagrant 262144 Mar 13 18:01 foo-ground-truth

weird thing is when i use only 1 .tif file and it's equivalent .gt.txt in the foo-ground-truth (for example: alexis_ruhe01_1852_0018_022.gt.txt and alexis_ruhe01_1852_0018_022.tif) i always get the Error: missing ground truth for training

Hi @royudev I have the same error, how did u fix this?

wrznr commented 4 years ago

@atuanbk58 Pls. see the answer by @Shreeshrii: It is not fixable. You will need at least two lines GT when setting the ration to 1:1 or ten lines with the default ratio of 9:1 between training and test set. However, both scenarios will not result in usable models. For training OCR, you will need several hundreds of GT lines.