Closed Shreeshrii closed 3 years ago
Attached is a zip file with box/tif pairs and also tif and ground truth files that can be used for testing.
Finetuning the above case - san
from script/Devanagari
.
Finetuning can also be tried with san
from san
or hin
.
@Shreeshrii May we add the sample data you provide to the test data in this repo?
May we add the sample data you provide to the test data in this repo?
Sure.
Noticed couple of problems with current PR.
WordStr box files seem to be creating incorrect unicharset i.e. characters W o r d S t r
are getting added to unicharset.
all-boxes
is a concatenation of ALL box files that is used for creating the unicharset. all-boxes is becoming very large. In my current test it is over 70 MB.
There maybe a different way to create unicharset. unicharset_extractor
can be used with ground truth text files also.
ubuntu@tesseract-ocr:~/ocrd-train/data$ unicharset_extractor
Usage: unicharset_extractor [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...]
Where mode means:
1=combine graphemes (use for Latin and other simple scripts)
2=split graphemes (use for Indic/Khmer/Myanmar)
3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
Reads box or plain text files to extract the unicharset.
I have marked this PR as Do not merge.
It can further be modified to add functionatity of training from text and fonts.
Please see https://github.com/tesseract-ocr/tesstrain/issues/93#issue-492026547
I am still having some problems getting the modified makefile to work with training_text and fonts. The first run stops with an error related to all-lstmf. When I give the same command again, it continues on from where it stopped without any error and completes successfully.
I would appreciate suggestions to fix it.
You can reproduce the problem by:
rm -rf data/eng-ground-truth
mkdir data/eng-ground-truth
cd data/eng-ground-truth
# Get config and dawg source files from langdata_lstm
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.numbers
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.punc
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/eng/eng.wordlist
# Get/Create fontlist and training text
wget https://github.com/tesseract-ocr/langdata/raw/master/eng/eng.training_text
echo "Impact Condensed" > eng.fonts_list
cd ../..
make clean MODEL_NAME=eng
make -r training \
MODEL_NAME=eng \
BUILD_TYPE=Impact \
START_MODEL=eng \
MAX_ITERATIONS=400 \
GROUND_TRUTH_DIR=data/eng-ground-truth \
FONTS_DIR=/home/ubuntu/.fonts \
FONTS_LIST=data/eng-ground-truth/eng.fonts_list \
TRAINING_TEXT=data/eng-ground-truth/eng.training_text
The error I get is at the bottom - Makefile:180: recipe for target 'data/eng/list.train' failed
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-69.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-70.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-71.exp0.tif
Rendered page 0 to file data/eng-ground-truth/Impact_Condensed-72.exp0.tif
mkdir -p data/eng
unicharset_extractor --output_unicharset "data/eng/unicharset" --norm_mode 1 "data/eng/all-gt"
Bad box coordinates in boxfile string! different New Articles page 23 a To Service ~~ a details DC that don't as 7 «« Date: #1 : AZ
Extracting unicharset from plain text file data/eng/all-gt
Other case É of é is not in unicharset
Wrote unicharset file data/eng/unicharset
mkdir -p data/eng
find data/eng-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/eng/all-lstmf"
mkdir -p data/eng
total=$(wc -l < data/eng/all-lstmf); \
train=$(echo "$total * 0.99 / 1" | bc); \
test "$train" = "0" && \
echo "Error: missing ground truth for training" && exit 1; \
eval=$(echo "$total - $train" | bc); \
test "$eval" = "0" && \
echo "Error: missing ground truth for evaluation" && exit 1; \
head -n "$train" data/eng/all-lstmf > "data/eng/list.train"; \
tail -n "$eval" data/eng/all-lstmf > "data/eng/list.eval"
Error: missing ground truth for training
Makefile:180: recipe for target 'data/eng/list.train' failed
make: *** [data/eng/list.train] Error 1
ubuntu@tesseract-ocr:~/tesstrain$
Restarting the make file works ok.
Error: missing ground truth for training
Makefile:180: recipe for target 'data/eng/list.train' failed
make: *** [data/eng/list.train] Error 1
ubuntu@tesseract-ocr:~/tesstrain$ make -r training MODEL_NAME=eng BUILD_TYPE=Impact START_MODEL=eng MAX_ITERATIONS=400 GROUND_TRUTH_DIR=data/eng-ground-truth FONTS_DIR=/home/ubuntu/.fonts FONTS_LIST=data/eng-ground-truth/eng.fonts_list TRAINING_TEXT=data/eng-ground-truth/eng.training_text
Makefile:373: warning: overriding recipe for target '/home/ubuntu/tessdata_best/eng.traineddata'
Makefile:198: warning: ignoring old recipe for target '/home/ubuntu/tessdata_best/eng.traineddata'
tesseract data/eng-ground-truth/Impact_Condensed-65.exp0.tif data/eng-ground-truth/Impact_Condensed-65.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-40.exp0.tif data/eng-ground-truth/Impact_Condensed-40.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-66.exp0.tif data/eng-ground-truth/Impact_Condensed-66.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
Page 1
tesseract data/eng-ground-truth/Impact_Condensed-61.exp0.tif data/eng-ground-truth/Impact_Condensed-61.exp0 --psm 6 --dpi 300 lstm.train
Tesseract Open Source OCR Engine v5.0.0-alpha-386-gbb4c6 with Leptonica
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@Shreeshrii Could you pls. rebase to the current master and check whether your problem persists?
@wrznr This PR has become too unwieldy for review because of addition of many options.
I suggest leaving this as reference for the time being and reviewing https://github.com/tesseract-ocr/tesstrain/pull/118 instead. Once that is OKed, then the rest can be added.
See https://github.com/wincentbalin/pytesstrain/blob/master/pytesstrain/cli/create_ground_truth.py for a script to Create single-line ground truth files from source file or directory.
See https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py to get a character count from the training_text.
Example usage for
Sanskrit
language model usingscript/Devanagari
as the model to continue from.