Closed talha1503 closed 4 years ago
echo -e "\n***** Extract LSTM model from best traineddata for $STARTMODEL. \n"
combine_tessdata -e ../tessdata/best/$STARTMODEL.traineddata ../training/$TRAINDIR/$STARTMODEL.lstm
Please change the paths as needed.
https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc
Specify option -e if you would like to extract individual components from a combined traineddata file. For example, to extract language config file and the unicharset from tessdata/eng.traineddata run:
combine_tessdata -e tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
So I had tried extracting the unicharset from tessdata_best/ara.traineddata . But I'm not able to extract the ara.unicharset file from it . I'm getting the following error as shown in the image.
The ara.traineddata file is present in the same directory . Can you please help me out? @Shreeshrii
ara.unicharset is for legacy engine.
you can extract ara.lstm-unicharset
Alternately just unpack whole traineddata for all files.
combine_tessdata -u ara.traineddata ara.
On Thu, Jun 18, 2020 at 5:31 PM Talha Chafekar notifications@github.com wrote:
So I had tried extracting the unicharset from tessdata_best/ara.traineddata . But I'm not able to extract the ara.unicharset file from it . I'm getting the following error as shown in the image. [image: Screenshot from 2020-06-18 17-30-11] https://user-images.githubusercontent.com/42352729/85017654-72a0d400-b189-11ea-86a7-a26dccba4a46.png
The ara.traineddata file is present in the same directory . Can you please help me out? @Shreeshrii https://github.com/Shreeshrii
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-645971091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZVMZOXEAV5GMOD3ALRXH63FANCNFSM4OBO5XEA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Got the config and lstm and unicharset from ara.traineddata (tessdata_best) . But I am having trouble in fine-tuning the ara.traineddata from 'tessdata_best' . I am getting the following error :
Can you please help @Shreeshrii
I am using the following code for fine tuning :
finetune: lists mkdir -p data/checkpoints lstmtraining \ --continue_from ara.lstm \ --traineddata ara.traineddata \ --train_listfile data/ara/list.train \ --eval_listfile data/ara/list.eval \ --model_output ./data/checkpoints/ara \ --max_iterations 10000
All the list.train and list.eval files are properly generated which consists of *.lstmf files from my data. But still , I am getting the encoding train
Does your training data have characters which are not there in ara.lstm-unicharset?
If so, you will need to do different kind of training.
Nope. My training data just consists of numbers along with '(' and ',' . Even for ground truth consisting of only numbers , it is giving the error for encoding missing .
Looked at the unicharset file. It did not have any arabic numerals. Is there any way in which I can add any required characters in the lstm unicharset file? @Shreeshrii
@Shreeshrii My training data has characters which are not there in the extracted lstm-unicharset file.So any way in which I can update the lstm-unicharset file?
RTL training needs special considerations. Please see
https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting
Also see this pending PR.
https://github.com/tesseract-ocr/tesstrain/pull/159
On Fri, Jun 19, 2020, 23:24 Talha Chafekar notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii My training data has characters which are not there in the extracted lstm-unicharset file.So any way in which I can update the lstm-unicharset file?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-646792055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I7365JD3HFIJW3A7YTRXOQ6PANCNFSM4OBO5XEA .
Hi @Shreeshrii , I had another doubt . When I am putting the ground truth of my Arabic image in English, I am not getting any error , but when my ground truth is in Arabic numerals , I am getting the encoding error (both while fine tuning). These English numerals are also present in the ara.lstm-unicharset file. Does this mean that my unicharset maps to the gound truth of my training data? Can you please help me out? And is it possible?
There 4 different ways of lstmtraining depending on your need, starting with 'impact' which can be used for adding a new font to existing traineddata to 'scratch' which is for a totally new one.
The commands and requirements are different for each.
If you are adding the Arabic numerals then you cannot do impact style training.
The unicharset in your starter traineddata needs to be based on your training text.
On Mon, Jun 22, 2020, 18:26 Talha Chafekar notifications@github.com wrote:
Hi @Shreeshrii https://github.com/Shreeshrii , I had another doubt . When I am putting the ground truth of my Arabic image in English, I am not getting any error , but when my ground truth is in Arabic numerals , I am getting the encoding error. These English numerals are also present in the ara.lstm-unicharset file. Does this mean that my unicharset maps to the gound truth of my training data? Can you please help me out?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-647501506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZTVX52EZ65B53ZYCTRX5IGRANCNFSM4OBO5XEA .
@Shreeshrii Oh! So does it mean that the unicharset depends on the text in the ground truth? For example : This is my image
And this is my corresponding ground truth
Also , by training text do you mean ground truth text?
Will it be the right way to do it?
@Shreeshrii I am able to fine tune the model . Thanks a lot for your help :) . However , I am facing one problem . My digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?
@theraysmith I was trying to fine tune Arabic , but my digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?
Please see this earlier message
Also see this pending PR.
https://github.com/tesseract-ocr/tesstrain/pull/159
The bidi algorithm needs to be used to reverse the Arabic text. The numerals are not treated as RTL.
@stweil Did you do any further training for Arabic? Any suggestions.
On Wed, Jun 24, 2020, 19:30 Talha Chafekar notifications@github.com wrote:
@theraysmith https://github.com/theraysmith I was trying to fine tune Arabic , but my digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-648838949, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6OSB6BZHJPYPSSSQDRYIBJPANCNFSM4OBO5XEA .
see https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
get_error_rate double 0.01 Stop training if the mean percent error rate gets below this value.
On Fri, Jun 26, 2020 at 5:11 PM Talha Chafekar notifications@github.com wrote:
Oh! @Shreeshrii https://github.com/Shreeshrii . I had one more doubt. What does target error mean in the makefile? Is it validation error or training error?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-650134804, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6JZZAQUKWCQJHFCODRYSCQNANCNFSM4OBO5XEA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
I was trying to fine tune the ara.traineddata model. I was wondering how I could generate the ara.lstm file from ara.traineddata file. @Shreeshrii Can you please help me out?