tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.38k stars 9.42k forks source link

different results on different platforms while PLUS training Greek character "Ø" #2382

Open yixinuestc opened 5 years ago

yixinuestc commented 5 years ago

Environment

Current Behavior:

I used some ways to recognize the characters in the attach file,just to verify whether can recognize the Greek character "Ø" .

  1. tesseract 11.png lll -l grc

  2. tesseract 11.png lll -l ell

3.retrain a new model 1)src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_tuned.txt \ --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties --exposures "0" \ --workspace_dir ~/share/workspace/tmp \ --save_box_tiff \ --fontlist "NSimSun" \ "Times New Roman" \ "Arial Unicode MS" \ "SimSun" \ "Noto Sans CJK SC" \ "Noto Sans Mono CJK SC" \ --output_dir ~/tesstutorial/chi_sim_train \ --overwrite 2) mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim 3)combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm 4)lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \ --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ --old_traineddata ../tessdata_best/chi_sim.traineddata \ --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ --max_iterations 3000 5)lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint \ --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned.traineddata

tesseract 11.png lll -l chi_sim_tuned

But all the method above can not recognize the Greek character "Ø" The image file, train_text ,unicharset file are in the attachment.

Expected Behavior:

Can recognize the Greek character "Ø".

Suggested Fix:

11 chi_sim_tuned.txt

chi_sim.zip

Shreeshrii commented 5 years ago

You need to remove your training directory before a new run otherwise it will continue from previously existing checkpoints and therefore not do any new training.

yixinuestc commented 5 years ago

I have removed ~/tesstutorial/chi_sim_train and ~/tesstutorial/chi_sim_tuned_from_chi_sim folders before I retrain a new model.

Shreeshrii commented 5 years ago

The letter that you have shown (Greek character "Ø" ) is not a Greek letter, rather https://www.compart.com/en/unicode/U+00D8

see https://en.wikipedia.org/wiki/%C3%98

This article is about the Scandinavian letter. For other uses, see Ø (disambiguation). Not to be confused with slashed 0, ∅, ф, or Φ.

Are you sure that your image is not conveying the slashed zero?

the slashed zero is sometimes approximated by overlaying zero and slash characters, producing the character "0̸"

yixinuestc commented 5 years ago

My image is not conveying the slash zero. If the "Ø" in the image is a Latin character,why it can't be recognized?Some times,it is recognized as "4" some times recognized as "0" I checked the generated .tif files,they contain the character "Ø".

Shreeshrii commented 5 years ago

There must be some error in your training process. I am getting correct result:

2382

tesseract 2382.png -  -l chi_sim_layer_int --dpi 300
Q345Ø426X12

Integer/Fast traineddata is attached - chi_sim_layer_int.zip

yixinuestc commented 5 years ago

Thanks.I have tried using your traineddata file,it is okay. By the way,why did you call the traineddata as "Integer/Fast" traineddata? Did you fine tuning using tessdata_fast/chi_sim.traineddata?

Shreeshrii commented 5 years ago

Did you fine tuning using tessdata_fast/chi_sim.traineddata?

NO. _fast models can't be used for fine tuning.

But trained models can be converted to integer format, they are smaller and faster, maybe slightly less accurate.

Use--convert_to_int flag with --stop_training.

Shreeshrii commented 5 years ago

Here is the best/float model.

chi_sim_layer.zip

This issue can be closed.

yixinuestc commented 5 years ago

Thank you very much. Last question: Did you use the same font list as mine during training?

Shreeshrii commented 5 years ago
#!/bin/bash

 rm -rf ~/tesstutorial/chi_sim_train

 ~/tesseract/src/training/tesstrain.sh \
 --fonts_dir ~/.fonts \
 --training_text ~/langdata_lstm/chi_sim/chi_sim.finetune.training_text \
 --langdata_dir ~/langdata_lstm \
 --tessdata_dir ~/tesseract/tessdata \
 --lang chi_sim --linedata_only \
 --noextract_font_properties  \
 --exposures "0" \
 --maxpages 0 \
 --workspace_dir ~/tmp \
 --save_box_tiff \
 --fontlist  \
 "NSimSun" \
 "Arial Unicode MS" \
 "SimSun" \
 "Noto Sans CJK SC" \
 "Noto Sans Mono CJK SC" \
 --output_dir ~/tesstutorial/chi_sim_train

rm -rf ~/tesstutorial/chi_sim_layer
mkdir ~/tesstutorial/chi_sim_layer

combine_tessdata -e ~/tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_layer/chi_sim.lstm

lstmtraining \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--append_index 5 --net_spec '[Lfx128 O1c1]' \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--debug_interval -1 \
--max_image_MB 6000 \
--max_iterations 6000

~/tesseract/bin/src/training/lstmtraining \
--stop_training \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim_layer_checkpoint  \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer.traineddata

~/tesseract/bin/src/training/lstmtraining \
--stop_training \
--convert_to_int \
--continue_from ~/tesstutorial/chi_sim_layer/chi_sim_layer_checkpoint  \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--model_output ~/tesstutorial/chi_sim_layer/chi_sim_layer_int.traineddata

cp  ~/tesstutorial/chi_sim_layer/*.traineddata ~/tessdata_best/

ls -l ~/tesstutorial/chi_sim_layer/*.traineddata
yixinuestc commented 5 years ago

Thank you.I see.

kbrajwani commented 4 years ago

@Shreeshrii i also want to add ∅, ф, Φ this symbol into the pretrained eng.traindata. Steps i done is Copy your repository https://github.com/Shreeshrii/tess4training change eng.training_text add ∅, ф, Φ this symbol into eng.training_text Then i run 8-makedata_layernew.sh and 9-layernew.sh

Some where it start recognize ∅, ф, Φ this symbol but its gone worst for the simple english words.

Shreeshrii commented 4 years ago

change eng.training_text add ∅, ф, Φ this symbol into eng.training_text

How big is your training text? For Replace layer training you need a large te

Your model is probably getting overfitted to training data.

kbrajwani commented 4 years ago

@Shreeshrii i only add 15 lines which contains ∅, ф, Φ this symbol into eng.training_text nothing much

Shreeshrii commented 4 years ago

As per Ray that's enough if you're doing finetuning plus-minus type training.

To replace the top layer in network you need a larger training text.

My example shows the method for adding a lot of different characters.

You have to use an appropriately large text from langdata_lstm repo and add your characters to it. Also make sure that the fonts you are using can render them.

On Fri, May 15, 2020, 10:50 Kumar Rajwani notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii i only add 15 lines which contains ∅, ф, Φ this symbol into eng.training_text nothing much

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-629033471, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4C3I3DUQOUVULXENLRRTGLVANCNFSM4HEPCAJA .

kbrajwani commented 4 years ago

@Shreeshrii i am try to do finetuning plus-minus from your repository https://github.com/Shreeshrii/tess4training .

i seen you add text 5-makedata_plusminus.sh in this file so i add my text in this file like below

cat <>../langdata/eng/eng.plusminus.training_text Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, (‘Beaver Ø History, 60 ¥ CONSPIRACY Jack ......... influence 15 of From It more in few LLC be is 24, find with 3 ” University you __ now! good You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals EOM

I am getting this error when i run 6-plusminus.sh

6-plusminus.sh: line 23: 8686 Segmentation fault (core dumped) lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus --continue_from ../tesstutorial/trainplusminus/eng.lstm --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata --old_traineddata tessdata/best/eng.traineddata --train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt --max_iterations 3600 Past Result Range

plz reply

Shreeshrii commented 4 years ago

Are all the paths to files correct?

Are you using eng.traineddata from tessdata_best?

Is eng.lstm extracted from the tessdata_best file?

On Tue, May 19, 2020, 12:51 Kumar Rajwani notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii i am try to do finetuning plus-minus from your repository https://github.com/Shreeshrii/tess4training .

i seen you add text 5-makedata_plusminus.sh in this file and add my text in this file like below

cat <>../langdata/eng/eng.plusminus.training_text Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, (‘Beaver Ø History, 60 ¥ CONSPIRACY Jack ......... influence 15 of From It more in few LLC be is 24, find with 3 ” University you __ now! good You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals EOM

I am getting this error when i run 6-plusminus.sh

6-plusminus.sh: line 23: 8686 Segmentation fault (core dumped) lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus --continue_from ../tesstutorial/trainplusminus/eng.lstm --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata --old_traineddata tessdata/best/eng.traineddata --train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt --max_iterations 3600 Past Result Range

plz reply

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630634929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4XRJV23LZMBH7JOLTRSIXPLANCNFSM4HEPCAJA .

kbrajwani commented 4 years ago

@Shreeshrii Its work fine if i run your 5-makedata_plusminus.sh and 6-plusminus.sh this without change , But as i change 5-makedata_plusminus.sh and add my text its like giving error of Segmentation fault .

Shreeshrii commented 4 years ago

You can modify the text outside of the script. I had done it that way to automate the process and hightlight the changes.

Check that the earlier process completes without error and creates the required files for running lstmtraining in next step.

On Tue, May 19, 2020, 13:06 Kumar Rajwani notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii Its work fine if i run your 5-makedata_plusminus.sh and 6-plusminus.sh this without change , But as i change 5-makedata_plusminus.sh and add my text its like giving error of Segmentation fault .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630642868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I46WPCM3QCJLTB772TRSIZJFANCNFSM4HEPCAJA .

kbrajwani commented 4 years ago

@Shreeshrii You can modify the text outside of the script. Means i will change end.training_text ?

Hey look into this notebook so you can get what i did?

https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb

Shreeshrii commented 4 years ago

What is the version of tesseract that you are using? Post output of tesseract -v

Try running gdb to get more details of the segmentation fault?

@stweil - Any other suggestions? lstmtraining is segfaulting afterreading first image.

On Tue, May 19, 2020 at 2:14 PM Kumar Rajwani notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii You can modify the text outside of the script. Means i will change end.training_text ?

Hey look into this notebook so you can get what i did?

https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630677917, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4MBRLKIXVNSQ6QXZTRSJBG3ANCNFSM4HEPCAJA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

kbrajwani commented 4 years ago

@Shreeshrii i am running this notebook in colab and i don't know about gdb.

output of tesseract -v tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2 Found AVX Found SSE

Shreeshrii commented 4 years ago

Please install latest version of tesseract, either build from source from GitHub master or use Alex's ppa if using Ubuntu.

On Tue, May 19, 2020, 15:33 Kumar Rajwani notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii i am running this notebook in colab and i don't know about gdb.

output of tesseract -v tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2 Found AVX Found SSE

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2382#issuecomment-630720562, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6C3AGM52ZNDY5KSF3RSJKP7ANCNFSM4HEPCAJA .

Shreeshrii commented 4 years ago

Please share the new text you are adding. Let me see if I can replicate the error.

kbrajwani commented 4 years ago

@Shreeshrii Hey i made this notebook https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb

Where i written new text also you can find everything what i did Thanks

Shreeshrii commented 4 years ago

I cannot replicate it on tesseract version 5.00 alpha See attached log.txt

Relevant portion below:


Extracting tessdata components from tessdata/best/eng.traineddata
Wrote ../tesstutorial/trainplusminus/eng.lstm
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
ubuntu@tesseract-ocr:~/tess4training$ bash 6-plusminus.sh

***** Run lstmtraining with debug output for first 100 iterations.

Loaded file ../tesstutorial/trainplusminus/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 114!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc114:114, 58482
Total weights = 1462546
Previous null char=110 mapped to 113
Continuing from ../tesstutorial/trainplusminus/eng.lstm
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Bold.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Bold.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Bold_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Arial_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Bold_Italic.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.DejaVu_Sans_Ultra-Light.exp0.lstmf
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Courier_New_Italic.exp0.lstmf
Iteration 0: GROUND  TRUTH : PhD-presenting MERGE REGULATION Irish Ø *P<0.05. REACHED Tampa HOME Feedback
Iteration 0: BEST OCR TEXT : PhD-presenting MERGE REGULATION Irish @ *P<0.05. REACHED Tampa HOME Feedback
File /tmp/eng-2020-05-19.OFI/eng.Arial_Bold.exp0.lstmf line 7 :
Mean rms=0.671%, delta=0.476%, train=2.632%(10%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia_Bold.exp0.lstmf
Iteration 1: GROUND  TRUTH : netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
Iteration 1: BEST OCR TEXT : netting Bookmark of WE MORE) STRENGTH IDENTICAL 12? activity PROPERTY MAINTAINED
File /tmp/eng-2020-05-19.OFI/eng.Arial_Bold_Italic.exp0.lstmf line 27 :
Mean rms=0.591%, delta=0.351%, train=2.566%(9.545%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia_Bold_Italic.exp0.lstmf
Iteration 2: GROUND  TRUTH : and first << article XML in NFL €] following 6 then and a know system Free 08 £20 years see
File ../tesstutorial/trainplusminus/eng.Arial.exp0.lstmf line 0 (Perfect):
Mean rms=0.449%, delta=0.234%, train=1.711%(6.364%), skip ratio=0%
Loaded 254/254 lines (1-254) of document ../tesstutorial/trainplusminus/eng.Georgia.exp0.lstmf

log.txt

kbrajwani commented 4 years ago

@Shreeshrii You added my text in eng.training_text or 5-makedata_plusminus.sh?

and please help me too install tesseract version 5.00 alpha.

kbrajwani commented 4 years ago

@Shreeshrii Hey its become like i am going one step ahead

Now i am able to start training but its become worst see training in notebook https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb

This are the like training is going on but i seen you log.txt is working great how?

Iteration 19: GROUND TRUTH : PENALTY. HAKATA (QUOTATIONS) Ø WeatherAlarmTM THOROUGHLY. EzineArticles Iteration 19: ALIGNED TRUTH : tititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititit Iteration 19: BEST OCR TEXT : tititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititititit

PLZ check that i have installed tesseract version 5.00 alpha correctly or not

Shreeshrii commented 4 years ago

The log looks like doing training from scratch. Let it run for about 3000 iterations and see what error rate you get.

Not sure, why this is happening. Maybe accessing some old file ..

Shreeshrii commented 4 years ago

I had

***** Run lstmtraining with debug output for first 100 iterations.

Loaded file ../tesstutorial/trainplusminus/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 111 to 114! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc114:114, 58482 Total weights = 1462546 Previous null char=110 mapped to 113

You have different values. Looks like somewhere a different traineddata file is being used.

kbrajwani commented 4 years ago

@Shreeshrii

Hey Great that you are replying so fast but i am new to train tesseract so i don't know whats the error are coming if you are able to train the on my data i am request you to make colab notebook because i am training tesseract using colab.

Please open my notebook

https://github.com/kbrajwani/learn/blob/master/Untitled8.ipynb

in

https://colab.research.google.com/

In notebook i have mentioned all github link where i am getting all files so you can understand the problem or if you can able to make new notebook that's help me a lot.

Shreeshrii commented 4 years ago

Problem is in this step - Text data add into 5-makedata_plusminus.sh

How are you adding this text? The text needs to be saved as UTF-8 with UNIX EOL. Alternately you can create a new file with your training_text and use it.

Shreeshrii commented 4 years ago

@kbrajwani Looks like you have identified a bug. While I can run the training on my ubuntu machine. it is failing in colab environment.

@stweil Please take a look when possible. See the following link, running tesseract on colab installed using Alex's ppa (AVX, AVX2 etc). Training starts as if from scratch, ignoring the startmodel. Same commands on ppc64le work as in plusminus training.

https://colab.research.google.com/drive/11NLa-52H-ofQHTN8ZVKvb9zUsdOpkDDX?usp=sharing

Shreeshrii commented 4 years ago

@stweil @amitdo I have checked the locale as well as tried setting --sequential learning, but training starts with different lines and there is vast difference in the error rates, on 3rd iteration, 0.676% in my environment and 191.749% on colab.

On my environment,

ubuntu@tesseract-ocr:~/tess4training$ uname -a
Linux tesseract-ocr 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:15 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@tesseract-ocr:~/tess4training$ tesseract -v
tesseract 5.0.0-alpha-595-gccb9
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found OpenMP 201511
Total weights = 1462033
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND  TRUTH : Ø TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, ('Beaver
Iteration 0: BEST OCR TEXT : @ TRADEMARKS §120.871 Gilmore, FREE More Number Low trying AWARD, ('Beaver
File /tmp/eng-2020-05-25.fhP/eng.Arial_Bold.exp0.lstmf line 33 :
Mean rms=0.734%, delta=0.811%, train=2.703%(9.091%), skip ratio=0%
Iteration 1: GROUND  TRUTH : or SC used By October Technology City And Business could Services (1) in Services 12 for
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 1 (Perfect):
Mean rms=0.434%, delta=0.405%, train=1.351%(4.545%), skip ratio=0%
Iteration 2: GROUND  TRUTH : does YOU OH 30 them its 1 comments are November URL Reply of a San'a' I've some The to:
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 2 :
Mean rms=0.424%, delta=0.366%, train=0.901%(3.03%), skip ratio=0%
Iteration 3: GROUND  TRUTH : 2003 password? new News [+] will through their Your of both find Sign first In article .
File /tmp/eng-2020-05-25.fhP/eng.Arial_Bold.exp0.lstmf line 3 (Perfect):
Mean rms=0.363%, delta=0.274%, train=0.676%(2.273%), skip ratio=0%

On colab

Linux 3a6dd4b34ac1 4.19.104+ #1 SMP Wed Feb 19 05:26:34 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
tesseract 5.0.0-alpha-671-g27d51
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found OpenMP 201511
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Total weights = 1462033
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND  TRUTH : You Ø | Silva, OPEN FRESNO groups. integrated 14. Map Metals
Iteration 0: BEST OCR TEXT : You @ | Silva, OPEN FRESNO groups. integrated 14. Map Metals
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 49 :
Mean rms=0.821%, delta=0.678%, train=3.333%(9.091%), skip ratio=0%
Iteration 1: GROUND  TRUTH : Avoidance Moosejaw pm* Ø18 note: PROBE Jailbroken RAISE Fountains Write Goods (Ø6)
Iteration 1: ALIGNED TRUTH : B
Iteration 1: BEST OCR TEXT : B
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 49 (Perfect):
Mean rms=-2.14748e+06%, delta=0.339%, train=51.057%(54.545%), skip ratio=0%
Iteration 2: GROUND  TRUTH : Heartbreakers (1976). {Lukevics:June Page Vandread Beauty @ ¥ Ø away ON
Iteration 2: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
Iteration 2: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 7 (Perfect):
Mean rms=-2.14748e+06%, delta=0.226%, train=145.775%(69.697%), skip ratio=0%
Iteration 3: GROUND  TRUTH : and first << article XML in NFL €] following 6 then and a know system Free 08 £20 years see
Iteration 3: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
Iteration 3: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBu
File /tmp/eng-2020-05-25.hxE/eng.Arial_Bold.exp0.lstmf line 30 (Perfect):
Mean rms=-2.14748e+06%, delta=0.169%, train=191.749%(77.273%), skip ratio=0%
Shreeshrii commented 4 years ago

You can try and replicate with

git clone https://github.com/Shreeshrii/tess4training.git cd tess4training bash 5-makedata_plusminustheta.sh bash 6-plusminustheta.sh

Shreeshrii commented 4 years ago

@amitdo Why are you closing this without even a response? Training should work similarly on all platforms and there is vast difference in results here using the same data.

amitdo commented 4 years ago

Shree,

Sorry for closing the issue (without a response), my mistake.

Shreeshrii commented 4 years ago

I checked on a different machine just now.

shree@sanskrit:~/tess4training$ uname -a
Linux sanskrit 4.4.0-148-generic #174~14.04.1-Ubuntu SMP Thu May 9 08:17:37 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
shree@sanskrit:~/tess4training$ tesseract -v
tesseract 4.1.1-rc2-21-gf4ef
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.3.0
 Found libarchive 3.1.2

...

Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND  TRUTH : such (4) 3 view Business R25 Click other {/if} This PATENTSCOPE® your £ Profile different iPod
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 141 (Perfect):
Mean rms=0.167%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 1: GROUND  TRUTH : 31 which COPYRIGHT DVDs out group April including just place 18 service Articles as could these
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 1 (Perfect):
Mean rms=0.156%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 2: GROUND  TRUTH : FITTING Tape company. Featured BOOK has PSYCHOTIC Ø CONTENT permeable LATVIA
Iteration 2: BEST OCR TEXT : FITTING Tape company. Featured BOOK has PSYCHOTIC @ CONTENT permeable LATVIA
File ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf line 2 :
Mean rms=0.439%, delta=0.407%, train=0.877%(3.03%), skip ratio=0%
Iteration 3: GROUND  TRUTH : top days Login this 2004 & - said first 27 then 2. $100 they FIG. [1] (GeneRIF) World and ABOUT
File /tmp/eng-2020-05-27.Fqo/eng.Arial_Bold.exp0.lstmf line 45 (Perfect):
Mean rms=0.365%, delta=0.305%, train=0.658%(2.273%), skip ratio=0%
Shreeshrii commented 4 years ago

But it is getting very different results on colab, which progressively get worse.

Linux 9bc150e0637f 4.19.104+ #1 SMP Wed Feb 19 05:26:34 PST 2020 x86_64 x86_64 x86_64 GNU/Linux
tesseract 4.1.1-rc2-21-gf4ef
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Previous null char=110 mapped to 112
Continuing from ../tesstutorial/trainplusminustheta/eng.lstm
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold.exp0.lstmf
Loaded 169/169 lines (1-169) of document ../tesstutorial/trainplusminustheta/eng.Arial_Bold_Italic.exp0.lstmf
Iteration 0: GROUND  TRUTH : of Go jobs describe ø Landry *80Ø/min (+10 CLASS what # him SONS, ON href=""
Iteration 0: BEST OCR TEXT : of Go jobs describe o Landry *80@/min (+10 CLASS what # him SONS, ON href=""
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 88 :
Mean rms=1.091%, delta=1.563%, train=5.263%(13.333%), skip ratio=0%
Iteration 1: GROUND  TRUTH : have Perhaps Big Windows I've Ø CHURCH'S FEMINIST Hate Mon-Sat PARKER OF
Iteration 1: ALIGNED TRUTH : B
Iteration 1: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 34 (Perfect):
Mean rms=-2.14748e+06%, delta=0.781%, train=51.937%(56.667%), skip ratio=0%
Iteration 2: GROUND  TRUTH : (CUR) & Amazon.com (Book) Conflict Papers Ø for GERMANY). Victor
Iteration 2: ALIGNED TRUTH : B
Iteration 2: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 28 (Perfect):
Mean rms=-2.14748e+06%, delta=0.521%, train=67.437%(71.111%), skip ratio=0%
Iteration 3: GROUND  TRUTH : different New Articles page 23 a To Service ~~ a details DC that don't as 7 «« Date: #1 : AZ
Iteration 3: ALIGNED TRUTH : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuB
Iteration 3: BEST OCR TEXT : uBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuBuB
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 12 (Perfect):
Mean rms=-2.14748e+06%, delta=0.391%, train=136.991%(78.333%), skip ratio=0%
Iteration 4: GROUND  TRUTH : CONTEST for thinking? 24-YEAR-OLD LAW_OFFENSE_CODE what Kyle x HEARD For - ø
Iteration 4: ALIGNED TRUTH : B
Iteration 4: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 9 (Perfect):
Mean rms=-2.14748e+06%, delta=0.313%, train=129.856%(82.667%), skip ratio=0%
Iteration 5: GROUND  TRUTH : privilege. Vineyards Center Ø LAYERS. Gernot White ONANISM
Iteration 5: ALIGNED TRUTH : B
Iteration 5: BEST OCR TEXT : B
File /tmp/eng-2020-05-27.kCk/eng.Arial_Bold.exp0.lstmf line 33 (Perfect):
Mean rms=-2.14748e+06%, delta=0.26%, train=125.167%(85.556%), skip ratio=0%
Shreeshrii commented 4 years ago

The main difference I see is with

Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE

@stweil Would this make such a big difference in training?