tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

How to generate *.lstm file from *.trainneddata file #169

Closed talha1503 closed 4 years ago

talha1503 commented 4 years ago

I was trying to fine tune the ara.traineddata model. I was wondering how I could generate the ara.lstm file from ara.traineddata file. @Shreeshrii Can you please help me out?

Shreeshrii commented 4 years ago
echo -e "\n***** Extract LSTM model from best traineddata for $STARTMODEL. \n"
combine_tessdata -e ../tessdata/best/$STARTMODEL.traineddata ../training/$TRAINDIR/$STARTMODEL.lstm

Please change the paths as needed.

Shreeshrii commented 4 years ago

https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

Specify option -e if you would like to extract individual components from a combined traineddata file. For example, to extract language config file and the unicharset from tessdata/eng.traineddata run:

combine_tessdata -e tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

talha1503 commented 4 years ago

So I had tried extracting the unicharset from tessdata_best/ara.traineddata . But I'm not able to extract the ara.unicharset file from it . I'm getting the following error as shown in the image. Screenshot from 2020-06-18 17-30-11

The ara.traineddata file is present in the same directory . Can you please help me out? @Shreeshrii

Shreeshrii commented 4 years ago

ara.unicharset is for legacy engine.

you can extract ara.lstm-unicharset

Alternately just unpack whole traineddata for all files.

combine_tessdata -u ara.traineddata ara.

On Thu, Jun 18, 2020 at 5:31 PM Talha Chafekar notifications@github.com wrote:

So I had tried extracting the unicharset from tessdata_best/ara.traineddata . But I'm not able to extract the ara.unicharset file from it . I'm getting the following error as shown in the image. [image: Screenshot from 2020-06-18 17-30-11] https://user-images.githubusercontent.com/42352729/85017654-72a0d400-b189-11ea-86a7-a26dccba4a46.png

The ara.traineddata file is present in the same directory . Can you please help me out? @Shreeshrii https://github.com/Shreeshrii

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-645971091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZVMZOXEAV5GMOD3ALRXH63FANCNFSM4OBO5XEA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

talha1503 commented 4 years ago

Got the config and lstm and unicharset from ara.traineddata (tessdata_best) . But I am having trouble in fine-tuning the ara.traineddata from 'tessdata_best' . I am getting the following error : Screenshot from 2020-06-18 17-58-04

Can you please help @Shreeshrii

I am using the following code for fine tuning :

finetune: lists mkdir -p data/checkpoints lstmtraining \ --continue_from ara.lstm \ --traineddata ara.traineddata \ --train_listfile data/ara/list.train \ --eval_listfile data/ara/list.eval \ --model_output ./data/checkpoints/ara \ --max_iterations 10000

All the list.train and list.eval files are properly generated which consists of *.lstmf files from my data. But still , I am getting the encoding train

Shreeshrii commented 4 years ago

Does your training data have characters which are not there in ara.lstm-unicharset?

If so, you will need to do different kind of training.

talha1503 commented 4 years ago

Nope. My training data just consists of numbers along with '(' and ',' . Even for ground truth consisting of only numbers , it is giving the error for encoding missing .

talha1503 commented 4 years ago

Looked at the unicharset file. It did not have any arabic numerals. Is there any way in which I can add any required characters in the lstm unicharset file? @Shreeshrii

talha1503 commented 4 years ago

@Shreeshrii My training data has characters which are not there in the extracted lstm-unicharset file.So any way in which I can update the lstm-unicharset file?

Shreeshrii commented 4 years ago

RTL training needs special considerations. Please see

https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting

Also see this pending PR.

https://github.com/tesseract-ocr/tesstrain/pull/159

On Fri, Jun 19, 2020, 23:24 Talha Chafekar notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii My training data has characters which are not there in the extracted lstm-unicharset file.So any way in which I can update the lstm-unicharset file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-646792055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I7365JD3HFIJW3A7YTRXOQ6PANCNFSM4OBO5XEA .

talha1503 commented 4 years ago

Hi @Shreeshrii , I had another doubt . When I am putting the ground truth of my Arabic image in English, I am not getting any error , but when my ground truth is in Arabic numerals , I am getting the encoding error (both while fine tuning). These English numerals are also present in the ara.lstm-unicharset file. Does this mean that my unicharset maps to the gound truth of my training data? Can you please help me out? And is it possible?

Shreeshrii commented 4 years ago

There 4 different ways of lstmtraining depending on your need, starting with 'impact' which can be used for adding a new font to existing traineddata to 'scratch' which is for a totally new one.

The commands and requirements are different for each.

If you are adding the Arabic numerals then you cannot do impact style training.

The unicharset in your starter traineddata needs to be based on your training text.

On Mon, Jun 22, 2020, 18:26 Talha Chafekar notifications@github.com wrote:

Hi @Shreeshrii https://github.com/Shreeshrii , I had another doubt . When I am putting the ground truth of my Arabic image in English, I am not getting any error , but when my ground truth is in Arabic numerals , I am getting the encoding error. These English numerals are also present in the ara.lstm-unicharset file. Does this mean that my unicharset maps to the gound truth of my training data? Can you please help me out?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-647501506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZTVX52EZ65B53ZYCTRX5IGRANCNFSM4OBO5XEA .

talha1503 commented 4 years ago

@Shreeshrii Oh! So does it mean that the unicharset depends on the text in the ground truth? For example : This is my image 55

And this is my corresponding ground truth Screenshot from 2020-06-22 18-47-36

Also , by training text do you mean ground truth text?

Will it be the right way to do it?

talha1503 commented 4 years ago

@Shreeshrii I am able to fine tune the model . Thanks a lot for your help :) . However , I am facing one problem . My digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?

talha1503 commented 4 years ago

@theraysmith I was trying to fine tune Arabic , but my digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?

Shreeshrii commented 4 years ago

Please see this earlier message

Also see this pending PR.

https://github.com/tesseract-ocr/tesstrain/pull/159

The bidi algorithm needs to be used to reverse the Arabic text. The numerals are not treated as RTL.

@stweil Did you do any further training for Arabic? Any suggestions.

On Wed, Jun 24, 2020, 19:30 Talha Chafekar notifications@github.com wrote:

@theraysmith https://github.com/theraysmith I was trying to fine tune Arabic , but my digits are getting reversed , but my text is getting detected properly ( both for Arabic ) . Can you please tell how I should go about solving this problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-648838949, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6OSB6BZHJPYPSSSQDRYIBJPANCNFSM4OBO5XEA .

Shreeshrii commented 4 years ago

see https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html

get_error_rate double 0.01 Stop training if the mean percent error rate gets below this value.

On Fri, Jun 26, 2020 at 5:11 PM Talha Chafekar notifications@github.com wrote:

Oh! @Shreeshrii https://github.com/Shreeshrii . I had one more doubt. What does target error mean in the makefile? Is it validation error or training error?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/169#issuecomment-650134804, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6JZZAQUKWCQJHFCODRYSCQNANCNFSM4OBO5XEA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com