tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

error when training #356

Closed CaiWenlie closed 7 months ago

CaiWenlie commented 8 months ago

i'm trying to train follow the steps, and it successfully starts training.

# my command
make training MODEL_NAME=captcha

after about 10 minutes, it logs the follows:

python shuffle.py 0 "data/captcha/all-lstmf"
+ head -n 936 data/captcha/all-lstmf
+ tail -n 104 data/captcha/all-lstmf
+ '[' Windows_NT = Windows_NT ']'
+ dos2unix data/captcha/all-lstmf
dos2unix: converting file data/captcha/all-lstmf to Unix format...
+ dos2unix data/captcha/list.train
dos2unix: converting file data/captcha/list.train to Unix format...
+ dos2unix data/captcha/list.eval
dos2unix: converting file data/captcha/list.eval to Unix format...
if [ "Windows_NT" = "Windows_NT" ]; then \
        dos2unix "data/captcha/captcha.numbers"; \
        dos2unix "data/captcha/captcha.punc"; \
        dos2unix "data/captcha/captcha.wordlist"; \
        dos2unix "data/langdata/captcha/captcha.config"; \
fi
dos2unix: data/captcha/captcha.numbers: No such file or directory
dos2unix: Skipping data/captcha/captcha.numbers, not a regular file.
dos2unix: data/captcha/captcha.punc: No such file or directory
dos2unix: Skipping data/captcha/captcha.punc, not a regular file.
dos2unix: data/captcha/captcha.wordlist: No such file or directory
dos2unix: Skipping data/captcha/captcha.wordlist, not a regular file.
dos2unix: data/langdata/captcha/captcha.config: No such file or directory
dos2unix: Skipping data/langdata/captcha/captcha.config, not a regular file.
make: *** [Makefile:308: data/captcha/captcha.traineddata] Error 2

here's the files in the output folder :

all-gt
all-lstmf
list.eval
list.train
unicharset

not found the 'traineddata'. what's wrong?

nkrot commented 7 months ago

you can try basing your new model captcha on an existing model, for example, English. For this, you need to set starter model via START_MODEL=eng, something like this:

make training START_MODEL=eng TESSDATA=/path/to/tessdata/

where TESSDATA should be the path to where you downloaded https://github.com/tesseract-ocr/tessdata. This directory has many prebuilt models files, named LANG.traineddata.

With START_MODEL=eng, training procedure will locate the file eng.traineddata (check Makefile $(TESSDATA)/$(START_MODEL).traineddata)

CaiWenlie commented 7 months ago

you can try basing your new model captcha on an existing model, for example, English. For this, you need to set starter model via START_MODEL=eng, something like this:

make training START_MODEL=eng TESSDATA=/path/to/tessdata/

where TESSDATA should be the path to where you downloaded https://github.com/tesseract-ocr/tessdata. This directory has many prebuilt models files, named LANG.traineddata.

With START_MODEL=eng, training procedure will locate the file eng.traineddata (check Makefile $(TESSDATA)/$(START_MODEL).traineddata)

Thanks, it works fine!