Open yixinuestc opened 5 years ago
You need to remove your training directory before a new run otherwise it will continue from previously existing checkpoints and therefore not do any new training.
I have removed ~/tesstutorial/chi_sim_train and ~/tesstutorial/chi_sim_tuned_from_chi_sim folders before I retrain a new model.
The letter that you have shown (Greek character "Ø" ) is not a Greek letter, rather https://www.compart.com/en/unicode/U+00D8
see https://en.wikipedia.org/wiki/%C3%98
This article is about the Scandinavian letter. For other uses, see Ø (disambiguation). Not to be confused with slashed 0, ∅, ф, or Φ.
Are you sure that your image is not conveying the slashed zero
?
the slashed zero is sometimes approximated by overlaying zero and slash characters, producing the character "0̸"
My image is not conveying the slash zero. If the "Ø" in the image is a Latin character,why it can't be recognized?Some times,it is recognized as "4" some times recognized as "0" I checked the generated .tif files,they contain the character "Ø".
There must be some error in your training process. I am getting correct result:
tesseract 2382.png - -l chi_sim_layer_int --dpi 300
Q345Ø426X12
Integer/Fast traineddata is attached - chi_sim_layer_int.zip
Thanks.I have tried using your traineddata file,it is okay. By the way,why did you call the traineddata as "Integer/Fast" traineddata? Did you fine tuning using tessdata_fast/chi_sim.traineddata?
Did you fine tuning using tessdata_fast/chi_sim.traineddata?
NO. _fast models can't be used for fine tuning.
But trained models can be converted to integer format, they are smaller and faster, maybe slightly less accurate.
Use--convert_to_int
flag with --stop_training
.
Thank you very much. Last question: Did you use the same font list as mine during training?
#!/bin/bash
rm -rf ~/tesstutorial/chi_sim_train
~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--training_text ~/langdata_lstm/chi_sim/chi_sim.finetune.training_text \
--langdata_dir ~/langdata_lstm \
--tessdata_dir ~/tesseract/tessdata \
--lang chi_sim --linedata_only \
--noextract_font_properties \
--exposures "0" \
--maxpages 0 \
--workspace_dir ~/tmp \
--save_box_tiff \
--fontlist \
"NSimSun" \
"Arial Unicode MS" \
"SimSun" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
--output_dir ~/tesstutorial/chi_sim_train
rm -rf ~/tesstutorial/chi_sim_layer
mkdir ~/tesstutorial/chi_sim_layer
combine_tessdata -e ~/tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_layer/chi_sim.lstm
lstmtraining \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--append_index 5 --net_spec '[Lfx128 O1c1]' \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--debug_interval -1 \
--max_image_MB 6000 \
--max_iterations 6000
~/tesseract/bin/src/training/lstmtraining \
--stop_training \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim_layer_checkpoint \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer.traineddata
~/tesseract/bin/src/training/lstmtraining \
--stop_training \
--convert_to_int \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim_layer_checkpoint \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer_int.traineddata
cp ~/tesstutorial/chi_sim_layer/*.traineddata ~/tessdata_best/
ls -l ~/tesstutorial/chi_sim_layer/*.traineddata
Thank you.I see.
@Shreeshrii i also want to add ∅, ф, Φ this symbol into the pretrained eng.traindata. Steps i done is Copy your repository https://github.com/Shreeshrii/tess4training change eng.training_text add ∅, ф, Φ this symbol into eng.training_text Then i run 8-makedata_layernew.sh and 9-layernew.sh
Some where it start recognize ∅, ф, Φ this symbol but its gone worst for the simple english words.
change eng.training_text add ∅, ф, Φ this symbol into eng.training_text
How big is your training text? For Replace layer training you need a large te
Your model is probably getting overfitted to training data.
@Shreeshrii i only add 15 lines which contains ∅, ф, Φ this symbol into eng.training_text nothing much
As per Ray that's enough if you're doing finetuning plus-minus type training.
To replace the top layer in network you need a larger training text.
My example shows the method for adding a lot of different characters.
You have to use an appropriately large text from langdata_lstm repo and add your characters to it. Also make sure that the fonts you are using can render them.
On Fri, May 15, 2020, 10:50 Kumar Rajwani notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii i only add 15 lines which contains ∅, ф, Φ this symbol into eng.training_text nothing much
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-629033471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4C3I3DUQOUVULXENLRRTGLVANCNFSM4HEPCAJA .
@Shreeshrii i am try to do finetuning plus-minus from your repository https://github.com/Shreeshrii/tess4training .
i seen you add text 5-makedata_plusminus.sh in this file so i add my text in this file like below
cat <>../langdata/eng/eng.plusminus.training_text Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, (‘Beaver Ø History, 60 ¥ CONSPIRACY Jack ......... influence 15 of From It more in few LLC be is 24, find with 3 ” University you __ now! good You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals EOM
I am getting this error when i run 6-plusminus.sh
6-plusminus.sh: line 23: 8686 Segmentation fault (core dumped) lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus --continue_from ../tesstutorial/trainplusminus/eng.lstm --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata --old_traineddata tessdata/best/eng.traineddata --train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt --max_iterations 3600 Past Result Range
plz reply
Are all the paths to files correct?
Are you using eng.traineddata from tessdata_best?
Is eng.lstm extracted from the tessdata_best file?
On Tue, May 19, 2020, 12:51 Kumar Rajwani notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii i am try to do finetuning plus-minus from your repository https://github.com/Shreeshrii/tess4training .
i seen you add text 5-makedata_plusminus.sh in this file and add my text in this file like below
cat <>../langdata/eng/eng.plusminus.training_text Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, (‘Beaver Ø History, 60 ¥ CONSPIRACY Jack ......... influence 15 of From It more in few LLC be is 24, find with 3 ” University you __ now! good You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals EOM
I am getting this error when i run 6-plusminus.sh
6-plusminus.sh: line 23: 8686 Segmentation fault (core dumped) lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus --continue_from ../tesstutorial/trainplusminus/eng.lstm --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata --old_traineddata tessdata/best/eng.traineddata --train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt --max_iterations 3600 Past Result Range
plz reply
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630634929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4XRJV23LZMBH7JOLTRSIXPLANCNFSM4HEPCAJA .
@Shreeshrii Its work fine if i run your 5-makedata_plusminus.sh and 6-plusminus.sh this without change , But as i change 5-makedata_plusminus.sh and add my text its like giving error of Segmentation fault .
You can modify the text outside of the script. I had done it that way to automate the process and hightlight the changes.
Check that the earlier process completes without error and creates the required files for running lstmtraining in next step.
On Tue, May 19, 2020, 13:06 Kumar Rajwani notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii Its work fine if i run your 5-makedata_plusminus.sh and 6-plusminus.sh this without change , But as i change 5-makedata_plusminus.sh and add my text its like giving error of Segmentation fault .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630642868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I46WPCM3QCJLTB772TRSIZJFANCNFSM4HEPCAJA .
@Shreeshrii You can modify the text outside of the script. Means i will change end.training_text ?
Hey look into this notebook so you can get what i did?
https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb
What is the version of tesseract that you are using? Post output of tesseract -v
Try running gdb to get more details of the segmentation fault?
@stweil - Any other suggestions? lstmtraining is segfaulting afterreading first image.
On Tue, May 19, 2020 at 2:14 PM Kumar Rajwani notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii You can modify the text outside of the script. Means i will change end.training_text ?
Hey look into this notebook so you can get what i did?
https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630677917, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4MBRLKIXVNSQ6QXZTRSJBG3ANCNFSM4HEPCAJA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
@Shreeshrii i am running this notebook in colab and i don't know about gdb.
output of tesseract -v tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2 Found AVX Found SSE
Please install latest version of tesseract, either build from source from GitHub master or use Alex's ppa if using Ubuntu.
On Tue, May 19, 2020, 15:33 Kumar Rajwani notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii i am running this notebook in colab and i don't know about gdb.
output of tesseract -v tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2 Found AVX Found SSE
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630720562, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6C3AGM52ZNDY5KSF3RSJKP7ANCNFSM4HEPCAJA .
Please share the new text you are adding. Let me see if I can replicate the error.
@Shreeshrii Hey i made this notebook https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb
Where i written new text also you can find everything what i did Thanks
I cannot replicate it on tesseract version 5.00 alpha See attached log.txt
Relevant portion below:
Extracting tessdata components from tessdata/best/eng.traineddata
Wrote ../tesstutorial/trainplusminus/eng.lstm
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
ubuntu@tesseract-ocr:~/tess4training$ bash 6-plusminus.sh
***** Run lstmtraining with debug output for first 100 iterations.
Loaded file ../tesstutorial/trainplusminus/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 114!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx512:512, 1247232
Fc114:114, 58482
Total weights = 1462546
Previous null char=110 mapped to 113
Continuing from ../tesstutorial/trainplusminus/eng.lstm
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Bold.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Bold.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Bold_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Bold_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.DejaVu_Sans_Ultra-Light.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Italic.exp0.lstmf
Iteration 0: GROUND TRUTH : PhD-presenting MERGE REGULATION Irish Ø *P<0.05. REACHED Tampa HOME Feedback
Iteration 0: BEST OCR TEXT : PhD-presenting MERGE REGULATION Irish @ *P<0.05. REACHED Tampa HOME Feedback
File /tmp/eng-2020-05-19.OFI/eng.Arial_Bold.exp0.lstmf line 7 :
Mean rms=0.671%, delta=0.476%, train=2.632%(10%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia_Bold.exp0.lstmf
Iteration 1: GROUND TRUTH : netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
Iteration 1: BEST OCR TEXT : netting Bookmark of WE MORE) STRENGTH IDENTICAL 12? activity PROPERTY MAINTAINED
File /tmp/eng-2020-05-19.OFI/eng.Arial_Bold_Italic.exp0.lstmf line 27 :
Mean rms=0.591%, delta=0.351%, train=2.566%(9.545%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia_Bold_Italic.exp0.lstmf
Iteration 2: GROUND TRUTH : and first << article XML in NFL €] following 6 then and a know system Free 08 £20 years see
File ../tesstutorial/trainplusminus/eng.Arial.exp0.lstmf line 0 (Perfect):
Mean rms=0.449%, delta=0.234%, train=1.711%(6.364%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia.exp0.lstmf
@Shreeshrii You added my text in eng.training_text or 5-makedata_plusminus.sh?
and please help me too install tesseract version 5.00 alpha.
@Shreeshrii Hey its become like i am going one step ahead
Now i am able to start training but its become worst see training in notebook https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb
This are the like training is going on but i seen you log.txt is working great how?
Iteration 19: GROUND TRUTH : PENALTY. HAKATA (QUOTATIONS) Ø WeatherAlarmTM THOROUGHLY. EzineArticles Iteration 19: ALIGNED TRUTH : tititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititit Iteration 19: BEST OCR TEXT : tititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititit
PLZ check that i have installed tesseract version 5.00 alpha correctly or not
The log looks like doing training from scratch. Let it run for about 3000 iterations and see what error rate you get.
Not sure, why this is happening. Maybe accessing some old file ..
I had
***** Run lstmtraining with debug output for first 100 iterations.
Loaded file ../tesstutorial/trainplusminus/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 111 to 114! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc114:114, 58482 Total weights = 1462546 Previous null char=110 mapped to 113
You have different values. Looks like somewhere a different traineddata file is being used.
@Shreeshrii
Hey Great that you are replying so fast but i am new to train tesseract so i don't know whats the error are coming if you are able to train the on my data i am request you to make colab notebook because i am training tesseract using colab.
Please open my notebook
https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb
in
https://colab.research.google.com/
In notebook i have mentioned all github link where i am getting all files so you can understand the problem or if you can able to make new notebook that's help me a lot.
Problem is in this step - Text data add into 5-makedata_plusminus.sh
How are you adding this text? The text needs to be saved as UTF-8 with UNIX EOL. Alternately you can create a new file with your training_text and use it.
@kbrajwani Looks like you have identified a bug. While I can run the training on my ubuntu machine. it is failing in colab environment.
@stweil Please take a look when possible. See the following link, running tesseract on colab installed using Alex's ppa (AVX, AVX2 etc). Training starts as if from scratch, ignoring the startmodel. Same commands on ppc64le work as in plusminus training.
https://colab.research.google.com/drive/11NLa-52H-ofQHTN8ZVKvb9zUsdOpkDDX?usp=sharing
@stweil @amitdo
I have checked the locale as well as tried setting --sequential learning
, but training starts with different lines and there is vast difference in the error rates, on 3rd iteration, 0.676% in my environment and 191.749% on colab.
On my environment,
ubuntu@tesseract-ocr:~/tess4training$ uname -a
Linux tesseract-ocr 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:15 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@tesseract-ocr:~/tess4training$ tesseract -v
tesseract 5.0.0-alpha-595-gccb9
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found OpenMP 201511
Total weights = 1462033
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND TRUTH : Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, ('Beaver
Iteration 0: BEST OCR TEXT : @ TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, ('Beaver
File /tmp/eng-2020-05-25.fhP/eng.Arial_Bold.exp0.lstmf line 33 :
Mean rms=0.734%, delta=0.811%, train=2.703%(9.091%), skip ratio=0%
Iteration 1: GROUND TRUTH : or SC used By October Technology City And Business could Services (1) in Services 12 for
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 1 (Perfect):
Mean rms=0.434%, delta=0.405%, train=1.351%(4.545%), skip ratio=0%
Iteration 2: GROUND TRUTH : does YOU OH 30 them its 1 comments are November URL Reply of a San'a' I've some The to:
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 2 :
Mean rms=0.424%, delta=0.366%, train=0.901%(3.03%), skip ratio=0%
Iteration 3: GROUND TRUTH : 2003 password? new News [+] will through their Your of both find Sign first In article .
File /tmp/eng-2020-05-25.fhP/eng.Arial_Bold.exp0.lstmf line 3 (Perfect):
Mean rms=0.363%, delta=0.274%, train=0.676%(2.273%), skip ratio=0%
On colab
Linux 3a6dd4b34ac1 4.19.104+ #1 SMP Wed Feb 19 05:26:34 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
tesseract 5.0.0-alpha-671-g27d51
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Total weights = 1462033
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND TRUTH : You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals
Iteration 0: BEST OCR TEXT : You @ | Silva, OPEN FRESNO groups. integrated 14. Map Metals
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 49 :
Mean rms=0.821%, delta=0.678%, train=3.333%(9.091%), skip ratio=0%
Iteration 1: GROUND TRUTH : Avoidance Moosejaw pm* Ø18 note: PROBE Jailbroken RAISE Fountains Write Goods (Ø6)
Iteration 1: ALIGNED TRUTH : B
Iteration 1: BEST OCR TEXT : B
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 49 (Perfect):
Mean rms=-2.14748e+06%, delta=0.339%, train=51.057%(54.545%), skip ratio=0%
Iteration 2: GROUND TRUTH : Heartbreakers (1976). {Lukevics:June Page Vandread Beauty @ ¥ Ø away ON
Iteration 2: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
Iteration 2: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 7 (Perfect):
Mean rms=-2.14748e+06%, delta=0.226%, train=145.775%(69.697%), skip ratio=0%
Iteration 3: GROUND TRUTH : and first << article XML in NFL €] following 6 then and a know system Free 08 £20 years see
Iteration 3: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
Iteration 3: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 30 (Perfect):
Mean rms=-2.14748e+06%, delta=0.169%, train=191.749%(77.273%), skip ratio=0%
You can try and replicate with
git clone https://github.com/Shreeshrii/tess4training.git cd tess4training bash 5-makedata_plusminustheta.sh bash 6-plusminustheta.sh
@amitdo Why are you closing this without even a response? Training should work similarly on all platforms and there is vast difference in results here using the same data.
Shree,
Sorry for closing the issue (without a response), my mistake.
I checked on a different machine just now.
shree@sanskrit:~/tess4training$ uname -a
Linux sanskrit 4.4.0-148-generic #174~14.04.1-Ubuntu SMP Thu May 9 08:17:37 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
shree@sanskrit:~/tess4training$ tesseract -v
tesseract 4.1.1-rc2-21-gf4ef
leptonica-1.76.0
libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.3.0
Found libarchive 3.1.2
...
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND TRUTH : such (4) 3 view Business R25 Click other {/if} This PATENTSCOPE® your £ Profile different iPod
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 141 (Perfect):
Mean rms=0.167%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 1: GROUND TRUTH : 31 which COPYRIGHT DVDs out group April including just place 18 service Articles as could these
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 1 (Perfect):
Mean rms=0.156%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 2: GROUND TRUTH : FITTING Tape company. Featured BOOK has PSYCHOTIC Ø CONTENT permeable LATVIA
Iteration 2: BEST OCR TEXT : FITTING Tape company. Featured BOOK has PSYCHOTIC @ CONTENT permeable LATVIA
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 2 :
Mean rms=0.439%, delta=0.407%, train=0.877%(3.03%), skip ratio=0%
Iteration 3: GROUND TRUTH : top days Login this 2004 & - said first 27 then 2. $100 they FIG. [1] (GeneRIF) World and ABOUT
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 45 (Perfect):
Mean rms=0.365%, delta=0.305%, train=0.658%(2.273%), skip ratio=0%
But it is getting very different results on colab, which progressively get worse.
Linux 9bc150e0637f 4.19.104+ #1 SMP Wed Feb 19 05:26:34 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
tesseract 4.1.1-rc2-21-gf4ef
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND TRUTH : of Go jobs describe ø Landry *80Ø/min (+10 CLASS what # him SONS, ON href=""
Iteration 0: BEST OCR TEXT : of Go jobs describe o Landry *80@/min (+10 CLASS what # him SONS, ON href=""
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 88 :
Mean rms=1.091%, delta=1.563%, train=5.263%(13.333%), skip ratio=0%
Iteration 1: GROUND TRUTH : have Perhaps Big Windows I've Ø CHURCH'S FEMINIST Hate Mon-Sat PARKER OF
Iteration 1: ALIGNED TRUTH : B
Iteration 1: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 34 (Perfect):
Mean rms=-2.14748e+06%, delta=0.781%, train=51.937%(56.667%), skip ratio=0%
Iteration 2: GROUND TRUTH : (CUR) & Amazon.com (Book) Conflict Papers Ø for GERMANY). Victor
Iteration 2: ALIGNED TRUTH : B
Iteration 2: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 28 (Perfect):
Mean rms=-2.14748e+06%, delta=0.521%, train=67.437%(71.111%), skip ratio=0%
Iteration 3: GROUND TRUTH : different New Articles page 23 a To Service ~~ a details DC that don't as 7 «« Date: #1 : AZ
Iteration 3: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuB
Iteration 3: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuB
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 12 (Perfect):
Mean rms=-2.14748e+06%, delta=0.391%, train=136.991%(78.333%), skip ratio=0%
Iteration 4: GROUND TRUTH : CONTEST for thinking? 24-YEAR-OLD LAW_OFFENSE_CODE what Kyle x HEARD For - ø
Iteration 4: ALIGNED TRUTH : B
Iteration 4: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 9 (Perfect):
Mean rms=-2.14748e+06%, delta=0.313%, train=129.856%(82.667%), skip ratio=0%
Iteration 5: GROUND TRUTH : privilege. Vineyards Center Ø LAYERS. Gernot White ONANISM
Iteration 5: ALIGNED TRUTH : B
Iteration 5: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 33 (Perfect):
Mean rms=-2.14748e+06%, delta=0.26%, train=125.167%(85.556%), skip ratio=0%
The main difference I see is with
Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE
@stweil Would this make such a big difference in training?
Environment
Current Behavior:
I used some ways to recognize the characters in the attach file,just to verify whether can recognize the Greek character "Ø" .
tesseract 11.png lll -l grc
tesseract 11.png lll -l ell
3.retrain a new model 1)src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_tuned.txt \ --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties --exposures "0" \ --workspace_dir ~/share/workspace/tmp \ --save_box_tiff \ --fontlist "NSimSun" \ "Times New Roman" \ "Arial Unicode MS" \ "SimSun" \ "Noto Sans CJK SC" \ "Noto Sans Mono CJK SC" \ --output_dir ~/tesstutorial/chi_sim_train \ --overwrite 2) mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim 3)combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm 4)lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \ --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ --old_traineddata ../tessdata_best/chi_sim.traineddata \ --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ --max_iterations 3000 5)lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned.traineddata
tesseract 11.png lll -l chi_sim_tuned
But all the method above can not recognize the Greek character "Ø" The image file, train_text ,unicharset file are in the attachment.
Expected Behavior:
Can recognize the Greek character "Ø".
Suggested Fix:
chi_sim_tuned.txt
chi_sim.zip