tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Training stops before all iterations have passed? #145

Closed Asa-Nisi-Masa closed 4 years ago

Asa-Nisi-Masa commented 4 years ago

I am training Tesseract and setting a large number of epochs but the training stops before all epochs have passed. Is this normal? E.g. I run the training with the following (similar) command

make -r training START_MODEL=start_name TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata MAX_ITERATIONS=200000 MODEL_NAME=newmodel RATIO_TRAIN=0.99 PSM=8

But training finishes (with no errors) at something like 50k iterations. What could be causing it?

Asa-Nisi-Masa commented 4 years ago

Sorry for closing earlier, but it seems that it indeed is a problem (at least on my part) still remains. I've just tried training tesseract for 500k epochs, but it preemptively terminated after ~120k with no error message.

Shreeshrii commented 4 years ago

Usually training will be for number of iterations or till error rate falls below a certain threshhold.

wrznr commented 4 years ago

@Asa-Nisi-Masa Why do you use PSM=8? It is a rather odd choice for training LSTMs?

Asa-Nisi-Masa commented 4 years ago

@Shreeshrii I had a feeling that this might be the case.

@wrznr That is because my pipeline recognizes lines of text in images and parses these line images using Tesseract. Also, text in these lines is without spaces, hence choosing PSM=8. Empirically, it also gives the best performance for my data.

wrznr commented 4 years ago

@Asa-Nisi-Masa Pls. elaborate: What did you do in order to test the performance of your data?

  1. Training with PSM=8 and recognition with PSM=8
  2. Training with PSM=13 and recognition with PSM=8
  3. Training with PSM=13 and recognition with PSM=13 What are the differences in the performance? (I am surprised that there even are differences between 1 and 2.)
Asa-Nisi-Masa commented 4 years ago

@wrznr I mainly compared PSM=8 and PSM=6 (default one). I compared

  1. Training with PSM=6 and recognition with PSM=6
  2. Training with PSM=6 and recognition with PSM=8
  3. Training with PSM=8 and recognition with PSM=8

as far as I remember, there is little difference between 2 and 3, but they outperform the default mode.

Shreeshrii commented 4 years ago
max_iterations int 0 Stop training after this many iterations.        
target_error_rate double 0.01 Stop training if the mean percent error rate gets below this value.

from https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html

wrznr commented 4 years ago

@Asa-Nisi-Masa Well, the possible difference between 2 and 3 is the interesting point here. Please also note that the default for recognition with Tesseract is PSM=3. The default for tesstrain is PSM=6 but this is a actually a very bad choice. You should either use 7 or 13. But as I wrote, it remains to be shown that it really makes a difference.

Asa-Nisi-Masa commented 4 years ago

@wrznr Thank you, I will attempt to do a systematic comparison between 7, 8 and 13.

As for the original issue, thank you @Shreeshrii , I think this can be closed now.