tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
621 stars 180 forks source link

High error rate on training with Impact on RTL language (Kur) Persian-Arabic script #151 #157

Closed sam-kurdi closed 4 years ago

sam-kurdi commented 4 years ago

I am getting a very high error (85) rate after training with Impact I have started the training by the following configuration :

tesseract version:

tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE

1:the workflow is utilized https://github.com/tesseract-ocr/tesstrain . Command-line to run make file is (make training MODEL_NAME=krd START_MODEL=ara LANG_TYPE=RTL FINETUNETYPE=Impact) 2:I added inherited.unicharset , ara.config , kur langdata, Arabic.unicharset, and Latin.unercharset provided by https://github.com/tesseract-ocr/langdata_lstm 3:I used ara.traineddata as a start model from https://github.com/tesseract-ocr/tessdata_best 4: (1304 / 2) imagelines + ground truth transcription

could you please tell is there any misconfiguration? how can I improve the accuracy rate?

Training Log :

pc1@pc:~/Desktop/tesstrain-master$ make training MODEL_NAME=krd START_MODEL=ara LANG_TYPE=RTL FINETUNETYPE=Impact

find data/krd-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/krd/all-gt" combine_tessdata -u /home/pc1/Desktop/tesstrain-master/usr/share/tessdata/ara.traineddata data/ara/krd Extracting tessdata components from /home/pc1/Desktop/tesstrain-master/usr/share/tessdata/ara.traineddata Wrote data/ara/krd.config Wrote data/ara/krd.lstm Wrote data/ara/krd.lstm-punc-dawg Wrote data/ara/krd.lstm-word-dawg Wrote data/ara/krd.lstm-number-dawg Wrote data/ara/krd.lstm-unicharset Wrote data/ara/krd.lstm-recoder Wrote data/ara/krd.version Version string:4.00.00alpha:ara:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 0:config:size=545, offset=192 17:lstm:size=11582395, offset=737 18:lstm-punc-dawg:size=1986, offset=11583132 19:lstm-word-dawg:size=999442, offset=11585118 20:lstm-number-dawg:size=13250, offset=12584560 21:lstm-unicharset:size=5061, offset=12597810 22:lstm-recoder:size=769, offset=12602871 23:version:size=80, offset=12603640 unicharset_extractor --output_unicharset "data/krd/my.unicharset" --norm_mode 3 "data/krd/all-gt" Bad box coordinates in boxfile string! Extracting unicharset from plain text file data/krd/all-gt Wrote unicharset file data/krd/my.unicharset merge_unicharsets data/ara/krd.lstm-unicharset data/krd/my.unicharset "data/krd/unicharset" Loaded unicharset of size 85 from file data/ara/krd.lstm-unicharset Loaded unicharset of size 73 from file data/krd/my.unicharset Wrote unicharset file data/krd/unicharset. PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/17.7.png" -t "data/krd-ground-truth/17.7.gt.txt" > "data/krd-ground-truth/17.7.box"

2 Percent improvement time=100, best error was 100 @ 0 At iteration 100/100/100, Mean rms=6.343%, delta=52.311%, char train=85.617%, word train=98.507%, skip ratio=0%, New best char error = 85.617 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/28.4.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/39.1.lstmf (LOG REMOVED....) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/45.12.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/85.3.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/12.6.lstmf

At iteration 200/200/200, Mean rms=6.631%, delta=58.281%, char train=92.803%, word train=99.254%, skip ratio=0%, New worst char error = 92.803 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/72.3.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/38.4.lstmf (LOG REMOVED...) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/92.5.lstmf

At iteration 300/300/300, Mean rms=6.674%, delta=60.127%, char train=95.198%, word train=99.502%, skip ratio=0%, New worst char error = 95.198 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/89.1.lstmf (LOG REMOVED...) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/25.3.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/88.2.lstmf

At iteration 400/400/400, Mean rms=6.706%, delta=61.546%, char train=96.399%, word train=99.627%, skip ratio=0%, New worst char error = 96.399 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/48.8.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/29.11.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/43.5.lstmf (LOG REMOVED...) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/46.4.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/28.13.lstmf At iteration 500/500/500, Mean rms=6.73%, delta=62.598%, char train=97.117%, word train=99.701%, skip ratio=0%, New worst char error = 97.117 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/26.3.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/87.5.lstmf (LOG REMOVED) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/94.1.lstmf

At iteration 600/600/600, Mean rms=6.746%, delta=63.418%, char train=97.598%, word train=99.751%, skip ratio=0%, New worst char error = 97.598 wrote checkpoint.

At iteration 700/700/700, Mean rms=6.757%, delta=63.932%, char train=97.941%, word train=99.787%, skip ratio=0%, New worst char error = 97.941 wrote checkpoint.

At iteration 800/800/800, Mean rms=6.758%, delta=64.207%, char train=98.198%, word train=99.813%, skip ratio=0%, New worst char error = 98.198 wrote checkpoint.

At iteration 900/900/900, Mean rms=6.753%, delta=64.287%, char train=98.398%, word train=99.834%, skip ratio=0%, New worst char error = 98.398 wrote checkpoint.

At iteration 1000/1000/1000, Mean rms=6.748%, delta=64.338%, char train=98.559%, word train=99.851%, skip ratio=0%, New worst char error = 98.559 wrote checkpoint.

At iteration 1100/1100/1100, Mean rms=6.78%, delta=65.498%, char train=99.997%, word train=100%, skip ratio=0%, New worst char error = 99.997 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer! Loaded 1/1 pages (1-1) of document data/krd-ground-truth/85.4.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/97.7.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/35.8.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/42.8.lstmf Loaded 1/1 pages (1-1) of document data/krd-ground-truth/77.1.lstmf At iteration 1200/1200/1200, Mean rms=6.772%, delta=65.877%, char train=99.998%, word train=100%, skip ratio=0%, New worst char error = 99.998 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/krd-ground-truth/24.15.lstmf (LOG REMOVED...) Loaded 1/1 pages (1-1) of document data/krd-ground-truth/53.1.lstmf

Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 1300/1300/1300, Mean rms=6.764%, delta=65.956%, char train=99.999%, word train=100%, skip ratio=0%, New worst char error = 99.999At iteration 1100, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.

At iteration 1400/1400/1400, Mean rms=6.753%, delta=65.951%, char train=99.999%, word train=100%, skip ratio=0%, wrote checkpoint.

(LOG REMOVED....)

At iteration 19817/19900/19900, Mean rms=5.893%, delta=41.14%, char train=93.046%, word train=99.661%, skip ratio=0%, wrote checkpoint.

At iteration 19917/20000/20000, Mean rms=5.891%, delta=41.096%, char train=92.917%, word train=99.605%, skip ratio=0%, wrote checkpoint.

Finished! Error rate = 85.617 lstmtraining \ --stop_training \ --continue_from data/krd/checkpoints/krd_checkpoint \ --traineddata data/krd/krd.traineddata \ --model_output data/krd.traineddata Loaded file data/krd/checkpoints/krd_checkpoint, unpacking... pc1@pc:~/Desktop/tesstrain-master$

Shreeshrii commented 4 years ago

Start from script/Arabic rather than ara.

Currently your unicharset is increasing from 85 to over 100. This is not suitable for fine-tuning.

sam-kurdi commented 4 years ago

Start from script/Arabic rather than ara. Currently your unicharset is increasing from 85 to over 100. This is not suitable for fine-tuning.

Thank you for your support.

is the below link is the correct script for LSTM training ? https://github.com/tesseract-ocr/tessdata_best/raw/master/script/Arabic.traineddata

I am using PSM 6 Which PSM do you recommend?

How to solve this issue (Normalization failed for string ) ?

Shreeshrii commented 4 years ago

Yes, tessdata_best/script/Arabic is preferable.

For single lines, I suggest using --psm 13.

Please make sure that correct RTL processing is happening in reversal of text for box files.

Shreeshrii commented 4 years ago

See https://github.com/tesseract-ocr/tesstrain/pull/137

sam-kurdi commented 4 years ago

Yes, tessdata_best/script/Arabic is preferable.

For single lines, I suggest using --psm 13.

Please make sure that correct RTL processing is happening in reversal of text for box files.

Thank you, Will do that.

Changes to the generate_wordstr_box.py as follow:

` create WordStr line boxes for Indic & RTL for line in lines: line = unicodedata.normalize('NFC', line.strip()) if args.rtl: FIXME: This should not be necessary. Compare with e.g. kraken line = line.translate(str.maketrans("()[]{}»«><", ")(][}{«»<>")) if line: print("WordStr 0 0 %d %d 0 #%s" % (width, height, line)) print("\t 0 0 %d %d 0" % (width, height))'

is this correct modification?

Shreeshrii commented 4 years ago

@sam-kurdi The text in the box file needs to be reversed using the bidi algorithm. Regarding the reversed punctuation marks, ( ) [ ] etc, please check whether it is needed or not.

I will upload a new training for ckb that I have done and you can check whether results are as expected on real life images. It gives over 95% accuracy with lstmeval on single line images similar to those used for training.

sam-kurdi commented 4 years ago

@Shreeshrii The above modification is reversing only (punc, para, etc). which I received the same error. it is correct, the text in the box need to be reversed. thank you, please let me know as soon as you updated.

Shreeshrii commented 4 years ago

Please see new PR https://github.com/tesseract-ocr/tesstrain/pull/159/commits

stweil commented 4 years ago

My first experience with training Arabic handwriting is documented here. The training is still running. I used the old generate_wordstr_box.py with an added line = bidi.algorithm.get_display(line).

Shreeshrii commented 4 years ago

After one epoch, the CER is at about 46 %. With sufficient training (200 epochs, about 32 hours), the CER falls below 5 %.

@stweil How is the EPOCH defined? Are you using a custom version of Makefile?

stweil commented 4 years ago

1 epoch = 1 iteration over all training data. It is commonly used for training of neural networks, but up to now not for Tesseract training.

Yes, this is currently a local custom version of Makefile which calculates MAX_ITERATIONS from EPOCHS:

@@ -49,8 +51,16 @@ TESSDATA_REPO = _best
 # Ground truth directory. Default: $(GROUND_TRUTH_DIR)
 GROUND_TRUTH_DIR := $(OUTPUT_DIR)-ground-truth

+# Epochs. Default: $(EPOCHS)
+EPOCHS :=
+
 # Max iterations. Default: $(MAX_ITERATIONS)
+ifeq ($(EPOCHS),)
 MAX_ITERATIONS := 10000
+else
+MAX_ITERATIONS := $(($(EPOCHS) * $(wc -l < $(OUTPUT_DIR)/list.train))
+MAX_ITERATIONS := $(shell echo $$(($(EPOCHS) * $$(wc -l < $(OUTPUT_DIR)/list.train))))
+endif

 # Debug Interval. Default:  $(DEBUG_INTERVAL)
 DEBUG_INTERVAL := 0
stweil commented 4 years ago

I updated https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting#training to explain what epochs means in the context of that training.

Shreeshrii commented 4 years ago

@stweil Thanks. Calculating MAX_ITERATIONS from EPOCHS is a good addition.

Since you are testing for RTL, it will be interesting to see tesseract results for https://github.com/OpenITI/OCR_GS_Data - maybe you can do a run for those too. I had tried a test earlier but I change too many things for it to be a valid comparison to their results.

Shreeshrii commented 4 years ago

I used the old generate_wordstr_box.py with an added line = bidi.algorithm.get_display(line).

@stweil Please check that your custom Makefile is using generate_wordstr_box.py. The Makefile currently in tesstrain master is using generate_line_box.py for RTL.

PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/krd-ground-truth/17.7.png" -t "data/krd-ground-truth/17.7.gt.txt" > "data/krd-ground-truth/17.7.box"
stweil commented 4 years ago

I had called generate_wordstr_box.py manually before running make.

sam-kurdi commented 4 years ago

Please see new PR https://github.com/tesseract-ocr/tesstrain/pull/159/commits thank you very much that helps a lot.

I have tested with the modified Makefile version, it works fine the finished error rate with my own data set is 2.33

@Shreeshrii @stweil the main problem in recognized images is Zero-width non-joiner and western Arabic numbers which are not handled properly. However, Eastern Arabic numbers handled properly. any suggestions to solve the related issue of the new model?

is it a mandatory step that the image lines and corresponding ground-truth must be the same font? I am wondering if you could mention how did you generate the provided dataset?

Shreeshrii commented 4 years ago

is it a mandatory step that the image lines and corresponding ground-truth must be the same font?

Ground truth should be in Unicode text format and can be rendered in any Unicode font. So font does not really matter for ground truth as long as it is not a legacy non Unicode font.

The test dataset was extracted from synthetic training data generated using Unicode text and fonts. I think rtltest.tgz has images in Unikurd-Jino font.

Shreeshrii commented 4 years ago

Is ZWNJ being used in certain character combinations?

sam-kurdi commented 4 years ago

Is ZWNJ being used in certain character combinations?

yes, it has been used in many gt files Also, western Arabic numbers which are not handled properly. However, Eastern Arabic numbers handled properly.

Shreeshrii commented 4 years ago

the finished error rate with my own data set is 2.33

Your training data is very limited number of lines. Try with more training data and include more samples of characters which are in error.

sam-kurdi commented 4 years ago

the finished error rate with my own data set is 2.33

Your training data is very limited number of lines. Try with more training data and include more samples of characters which are in error.

I will prepare more training data, how about ZWNJ and WAN

sam-kurdi commented 4 years ago

@Shreeshrii error rate = 4.007 after training with rtltest-ground-truth.

Shreeshrii commented 4 years ago

I have tested with the modified Makefile version, it works fine the finished error rate with my own data set is 2.33

error rate = 4.007 after training with rtltest-ground-truth.

Yes, error rate will depend on number of iteration as well as number of lines of training data.

How many lines of text are there in your training set?

sam-kurdi commented 4 years ago

I have tested with the modified Makefile version, it works fine the finished error rate with my own data set is 2.33

error rate = 4.007 after training with rtltest-ground-truth.

Yes, error rate will depend on number of iteration as well as number of lines of training data.

How many lines of text are there in your training set?

550 image lines.

sam-kurdi commented 4 years ago

@Shreeshrii @stweil @theraysmith Any suggestion to Training/Fine Tuning Tesseract OCR LSTM for New Fonts with make file, by utilizing tesstrain improvement for rtl?

Shreeshrii commented 4 years ago

western Arabic numbers which are not handled properly. However, Eastern Arabic numbers handled properly.

Clarify what you mean by WAN - is it 0-9 or farsi numbers?

EAN I assume is numbers in Arabic script?

sam-kurdi commented 4 years ago

western Arabic numbers which are not handled properly. However, Eastern Arabic numbers handled properly.

Clarify what you mean by WAN - is it 0-9 or farsi numbers?

EAN I assume is numbers in Arabic script?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.