tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Feature Request: Add option to train from text and fonts #93

Open Shreeshrii opened 5 years ago

Shreeshrii commented 5 years ago

Adding a new target similar to following may work.

FONTLIST=$(SCRIPT_DIR)/$(MODEL_NAME)/ok-fonts.txt
TRAINTEXT=$(SCRIPT_DIR)/$(MODEL_NAME)/$(MODEL_NAME).training_text
LINENUM=0

text2imageboxtiffunicharset:
    while read -r fontname; do \
        while read -r trainline; do \
            ((LINENUM = LINENUM + 1)); \
            echo "$$trainline" >tmp.txt; \
            OMP_THREAD_LIMIT=1   text2image --fonts_dir=/home/ubuntu/.fonts  --strip_unrenderable_words --xsize=2500 --ysize=150  --leading=12 --margin=12  --char_spacing=0.0 --exposure=0  --max_pages=0 --font="$$fontname" --text=tmp.txt  --outputbase="$(GROUND_TRUTH_DIR)/$${fontname// /_}-$$LINENUM.exp0"; \
        done <$(TRAINTEXT); \
    done <$(FONTLIST); \
    cp "$(TRAINTEXT)"  "$(ALL_GT)"
Shreeshrii commented 3 years ago

See https://github.com/wincentbalin/pytesstrain/blob/master/pytesstrain/cli/create_ground_truth.py for a script to Create single-line ground truth files from source file or directory.