tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

DNM- Add support for different types of languages and finetune options - including option to train from text files #87

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 5 years ago

Example usage for Sanskrit language model using script/Devanagari as the model to continue from.

 make training MODEL_NAME=san START_MODEL=script/Devanagari  LANG_TYPE=Indic FINETUNE_TYPE=Impact

 make training MODEL_NAME=san START_MODEL=script/Devanagari  LANG_TYPE=Indic FINETUNE_TYPE=Plus

 make training MODEL_NAME=san START_MODEL=script/Devanagari  LANG_TYPE=Indic FINETUNE_TYPE=Layer
Shreeshrii commented 5 years ago

Attached is a zip file with box/tif pairs and also tif and ground truth files that can be used for testing.

Finetuning the above case - san from script/Devanagari. Finetuning can also be tried with san from san or hin.

san-ground-truth.zip

wrznr commented 5 years ago

@Shreeshrii May we add the sample data you provide to the test data in this repo?

Shreeshrii commented 5 years ago

May we add the sample data you provide to the test data in this repo?

Sure.

Shreeshrii commented 5 years ago

Noticed couple of problems with current PR.

  1. WordStr box files seem to be creating incorrect unicharset i.e. characters W o r d S t r are getting added to unicharset.

  2. all-boxes is a concatenation of ALL box files that is used for creating the unicharset. all-boxes is becoming very large. In my current test it is over 70 MB.

There maybe a different way to create unicharset. unicharset_extractor can be used with ground truth text files also.

ubuntu@tesseract-ocr:~/ocrd-train/data$ unicharset_extractor
Usage: unicharset_extractor [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...]
Where mode means:
 1=combine graphemes (use for Latin and other simple scripts)
 2=split graphemes (use for Indic/Khmer/Myanmar)
 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
Reads box or plain text files to extract the unicharset.

I have marked this PR as Do not merge.

Shreeshrii commented 5 years ago

It can further be modified to add functionatity of training from text and fonts.

Please see https://github.com/tesseract-ocr/tesstrain/issues/93#issue-492026547

Shreeshrii commented 5 years ago

I am still having some problems getting the modified makefile to work with training_text and fonts. The first run stops with an error related to all-lstmf. When I give the same command again, it continues on from where it stopped without any error and completes successfully.

I would appreciate suggestions to fix it.

You can reproduce the problem by:

rm -rf data/eng-ground-truth
mkdir data/eng-ground-truth

cd data/eng-ground-truth
# Get config and dawg source files from langdata_lstm
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.numbers
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.punc
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.wordlist

# Get/Create fontlist and training text 
wget https://github.com/tesseract-ocr/langdata/raw/master/eng/eng.training_text
echo  "Impact Condensed" > eng.fonts_list

cd ../..

make clean MODEL_NAME=eng

make -r training \
MODEL_NAME=eng  \
BUILD_TYPE=Impact  \
START_MODEL=eng \
MAX_ITERATIONS=400 \
GROUND_TRUTH_DIR=data/eng-ground-truth \
FONTS_DIR=/home/ubuntu/.fonts \
FONTS_LIST=data/eng-ground-truth/eng.fonts_list \
TRAINING_TEXT=data/eng-ground-truth/eng.training_text 

The error I get is at the bottom - Makefile:180: recipe for target 'data/eng/list.train' failed

Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-69.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-70.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-71.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-72.exp0.tif
mkdir -p data/eng
unicharset_extractor --output_unicharset "data/eng/unicharset" --norm_mode 1 "data/eng/all-gt"
Bad box coordinates in boxfile string! different New Articles page 23 a To Service ~~ a details DC that don't as 7 «« Date: #1 : AZ
Extracting unicharset from plain text file data/eng/all-gt
Other case É of é is not in unicharset
Wrote unicharset file data/eng/unicharset
mkdir -p data/eng
find data/eng-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/eng/all-lstmf"
mkdir -p data/eng
total=$(wc -l < data/eng/all-lstmf); \
  train=$(echo "$total * 0.99 / 1" | bc); \
  test "$train" = "0" && \
    echo "Error: missing ground truth for training" && exit 1; \
  eval=$(echo "$total - $train" | bc); \
  test "$eval" = "0" && \
    echo "Error: missing ground truth for evaluation" && exit 1; \
  head -n "$train" data/eng/all-lstmf > "data/eng/list.train"; \
  tail -n "$eval" data/eng/all-lstmf > "data/eng/list.eval"
Error: missing ground truth for training
Makefile:180: recipe for target 'data/eng/list.train' failed
make: *** [data/eng/list.train] Error 1
ubuntu@tesseract-ocr:~/tesstrain$

Restarting the make file works ok.

Error: missing ground truth for training
Makefile:180: recipe for target 'data/eng/list.train' failed
make: *** [data/eng/list.train] Error 1
ubuntu@tesseract-ocr:~/tesstrain$ make -r training MODEL_NAME=eng  BUILD_TYPE=Impact  START_MODEL=eng MAX_ITERATIONS=400 GROUND_TRUTH_DIR=data/eng-ground-truth FONTS_DIR=/home/ubuntu/.fonts FONTS_LIST=data/eng-ground-truth/eng.fonts_list TRAINING_TEXT=data/eng-ground-truth/eng.training_text
Makefile:373: warning: overriding recipe for target '/home/ubuntu/tessdata_best/eng.traineddata'
Makefile:198: warning: ignoring old recipe for target '/home/ubuntu/tessdata_best/eng.traineddata'
tesseract data/eng-ground-truth/Impact_Condensed-65.exp0.tif data/eng-ground-truth/Impact_Condensed-65.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-40.exp0.tif data/eng-ground-truth/Impact_Condensed-40.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-66.exp0.tif data/eng-ground-truth/Impact_Condensed-66.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-61.exp0.tif data/eng-ground-truth/Impact_Condensed-61.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wrznr commented 4 years ago

@Shreeshrii Could you pls. rebase to the current master and check whether your problem persists?

Shreeshrii commented 4 years ago

@wrznr This PR has become too unwieldy for review because of addition of many options.

I suggest leaving this as reference for the time being and reviewing https://github.com/tesseract-ocr/tesstrain/pull/118 instead. Once that is OKed, then the rest can be added.

Shreeshrii commented 3 years ago

See https://github.com/wincentbalin/pytesstrain/blob/master/pytesstrain/cli/create_ground_truth.py for a script to Create single-line ground truth files from source file or directory.

Shreeshrii commented 3 years ago

See https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py to get a character count from the training_text.