tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Make training model #31

Closed atharmarajah closed 5 years ago

atharmarajah commented 5 years ago

Hi! I am trying to train a model with the test dataset given in the repo.

I am trying the following command after the installation of the dependencies:

 make training MODEL_NAME=name-of-the-resulting-model

Unfortunately when I try this command I get the following errors:

tesseract data/train/image.tif data/train/image --psm 6 lstm.train
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadFromMultipageTiff: function not present
...
...
Failed to load list of training filenames from data/list.train
Makefile:129 : recipe for target « data/checkpoints/name-of-the-resulting-model_checkpoint » failed
make: *** [data/checkpoints/name-of-the-resulting-model_checkpoint] Error 1

Thanks!

kba commented 5 years ago

You need leptonica with TIFF support or rather tesseract does. What version of leptonica are you using and how did you compile it?

atharmarajah commented 5 years ago

Thank you, i have been able to install the libraries libpng, libtiff, libjpeg and the command worked. I have been able to train a model from your test dataset. I wanted to test the created model on an image with the following command :

 $HOME/local/bin/tesseract ./data/train/alexis_ruhe01_1852_0219_004.tif output_image --oem 1 -l foo

I get the following error : Error: LSTM requested, but not present!! Loading tesseract. Failed loading language 'foo' Tesseract couldn't load any languages! Could not initialize tesseract.

The command works when I put eng as the language but not with this model. Also I have put the file in the tessdata folder. Could you help me please? Also could you tell me more about the foo.traineddata file which is created?

wrznr commented 5 years ago

A .traineddata file is basically an archive containing all the files created during training. The error you pasted indicates that the model could not be found by tesseract. Could you please try to use tesseract's --tessdata-dir to point to the directory where you store foo.traineddata?

wrznr commented 5 years ago

Please also note that foo is an (maybe the) arbitrary identifier. You can choose whatever name you like instead.

wrznr commented 5 years ago

@atharmarajah Can we be of further assistant here?

wrznr commented 5 years ago

Okay, as mentioned above, the first aspect is not an error. It is just a lack of feature. We do not yet support lstm-specific dictionaries. This does not hinder successful training (rather the contrary, here at OCR-D we do not believe in dictionaries)! The second point is more severe. To me, it looks like a mismatch between the files used for building the unicharset and those used for training. The third thing you mention supports this somehow: Why would you try to recognize a German sentence with a model apparently trained on French text (Are your sure that your are working with our example data)?

Please also note that name-of-the-resulting-model is a placeholder.

Here is my proposal: I will setup a step-by-step recipe for building a model with our example data and include hints on how to do training with your own data.

atharmarajah commented 5 years ago

Hi, Finally, I managed to train a model with your data and it's working fine, it was my mistake. Ok thanks a lot for the recipe, can you include a way of dealing with the unicharset issue in it, please ? I tried to find some documentation on internet, but nothing helped me. Thanks again for your help!

wrznr commented 5 years ago

Sorry for replying late. I did not manage to setup the recipe, yet. What are your referring to with "the unicharset issue in it"?

0xSalim commented 5 years ago

Hi,

I'm working with @atharmarajah on the same project.

He was talking about this kind of errors:

Can't encode transcription: 'boins tesmoins de toi au ꝯmũ de ta vile et' in language ''
Encoding of string failed! Failure bytes: ffffffcc ffffff83 6d 65 6e 74 20 6e 61 74 75 72 65

I also attached the full log file of the training process if you want further details. test_ark_7_written_labels.log

We also have this error, but we don't know where it comes from:

Compute CTC targets failed!

Thanks a lot for your help!

wrznr commented 5 years ago

@salimtalout The log is not complete. Could you please make clean, start the training again and provide the complete log?

0xSalim commented 5 years ago

@wrznr Sure, here is the last one. I did make clean before starting the training. test_ark_corr_18.log Thanks again

wrznr commented 5 years ago

Okay, the log output looks good in terms of ocrd-train. Everything runs as expected.

Concerning Compute CTC targets failed! That is an error message by Tesseract. I do not know what it means. Sry. There is some discussion in Tesseract's GitHub repo, which indicates that the training process does not converge (this would explain your poor results). If you post one of your training images, I could compare it ours.

The Can't encode transcription: error is not present in your log. Can we consider that solved?

0xSalim commented 5 years ago

Hi,

I finally found where the Compute CTC targets failed! error comes from, and as you said, it comes from my images that contain too much noise. After cleaning them, I no longer have this error.

Concerning the Can't encode transcription error, it is not solved yet. It's not present in the log file because I used labels that don't have any weird character. I think it comes from our input data that was not encoded in UTF8. I'll try to re-encode them properly. The first log file I sent you contains this kind of error, if you want to check it.

Thanks a lot for your help!

wrznr commented 5 years ago

Fine! Feel free to share your insights on training (e.g. Ground Truth set size, optimal number of iterations etc.).

atharmarajah commented 5 years ago

Hi @wrznr , thank you for all the help. We are still working on the project and we wanted to get some more information. It is about the error rate which is displayed during each iteration of the training and also at the end of the training. Is it the error rate for the training set? Because we would like to have visibility on the validation error rate so that we can tune hypermarameters to have a better model.

wrznr commented 5 years ago

Sorry @atharmarajah, I have to refer you to the tesseract user forum for questions concerning the output during the tesseract training process. Pls. remember that the tesseract training tool is not part of this project which simply aims to be an ease-of-use wrapper around the rather intransparent shell scripts shipped with tesseract.

The makefile uses the eval_listfile parameter to provide validation files for the training (https://github.com/OCR-D/ocrd-train/blob/47ed14f08f68b60cd094273d3157eff2ac53f572/Makefile#L151).

atharmarajah commented 5 years ago

Okay, thank you for your help and responsiveness