Closed Asa-Nisi-Masa closed 4 years ago
Sorry for closing earlier, but it seems that it indeed is a problem (at least on my part) still remains. I've just tried training tesseract for 500k epochs, but it preemptively terminated after ~120k with no error message.
Usually training will be for number of iterations or till error rate falls below a certain threshhold.
@Asa-Nisi-Masa Why do you use PSM=8
? It is a rather odd choice for training LSTMs?
@Shreeshrii I had a feeling that this might be the case.
@wrznr That is because my pipeline recognizes lines of text in images and parses these line images using Tesseract. Also, text in these lines is without spaces, hence choosing PSM=8
. Empirically, it also gives the best performance for my data.
@Asa-Nisi-Masa Pls. elaborate: What did you do in order to test the performance of your data?
PSM=8
and recognition with PSM=8
PSM=13
and recognition with PSM=8
PSM=13
and recognition with PSM=13
What are the differences in the performance? (I am surprised that there even are differences between 1 and 2.)@wrznr I mainly compared PSM=8
and PSM=6
(default one). I compared
PSM=6
and recognition with PSM=6
PSM=6
and recognition with PSM=8
PSM=8
and recognition with PSM=8
as far as I remember, there is little difference between 2 and 3, but they outperform the default mode.
max_iterations | int | 0 | Stop training after this many iterations. | ||||
---|---|---|---|---|---|---|---|
target_error_rate | double | 0.01 | Stop training if the mean percent error rate gets below this value. |
from https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
@Asa-Nisi-Masa Well, the possible difference between 2 and 3 is the interesting point here. Please also note that the default for recognition with Tesseract is PSM=3
. The default for tesstrain is PSM=6
but this is a actually a very bad choice. You should either use 7 or 13. But as I wrote, it remains to be shown that it really makes a difference.
@wrznr Thank you, I will attempt to do a systematic comparison between 7, 8 and 13.
As for the original issue, thank you @Shreeshrii , I think this can be closed now.
I am training Tesseract and setting a large number of epochs but the training stops before all epochs have passed. Is this normal? E.g. I run the training with the following (similar) command
make -r training START_MODEL=start_name TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata MAX_ITERATIONS=200000 MODEL_NAME=newmodel RATIO_TRAIN=0.99 PSM=8
But training finishes (with no errors) at something like 50k iterations. What could be causing it?