tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Problem: unicharset_extractor freezes #141

Closed FabioLugli closed 4 years ago

FabioLugli commented 4 years ago

I'm using Ubuntu 16.04 on a WSL on windows. I have correctly installed tesseract and leptonica, but when i use the command: sudo make training the terminal stays frozen on the phrase: unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"

From here it happens nothing, i have to stop the process. Other commands like: sudo make lists works instead. What could be the problem?

Shreeshrii commented 4 years ago

I have faced this problem and do not know what causes it. Probably too many files of gt.txt.

My workaround has been to copy my training text as "data/foo/all-gt" before running make training.

example:

cd data/$MODEL

for f in $SCRIPTPATH/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/*.gt.txt; do (cat "${f}"; echo) >> all-gt; done

cat  /home/ubuntu/langdata_save_lstm/ara/ara.minusnew.training_text  >> all-gt 

cd ../..

nohup make  training  \
MODEL_NAME=$MODEL  \
LANG_TYPE=RTL \
BUILD_TYPE=Minus  \
TESSDATA=/home/ubuntu/tessdata_best \
GROUND_TRUTH_DIR=$SCRIPTPATH/OCR_GS_Data/ara \
START_MODEL=script/Arabic \
RATIO_TRAIN=0.99 \
DEBUG_INTERVAL=-1 \
MAX_ITERATIONS=200000 > $MODEL.log & 

I also create the all-lstmf outside of makefile process.

FabioLugli commented 4 years ago

Thanks for the quick response, i'll try it immediatly.

artisvirat commented 4 years ago

@FabioLugli Did it work?

FabioLugli commented 4 years ago

I tried unsuccessfully to follow the procedure of Shreeshrii; looking on how i made the all-gt file i found that at the and of each line there was a CRLF (Windows format) instead of only LF (Linux format). Changing that the procedure went on correctly.