Open bertsky opened 3 years ago
Thanks for your advises! We'll do some evaluation regarding this Issue, since we plan to utilize this model (or a model based on this workflow / training data) for current running digitalization of historical newspapers / "Zeitungsprojekt HP II"
In light of https://github.com/tesseract-ocr/tesseract/issues/3560 (which describes how not only tesstrain's own CER estimation is completely off but also why its checkpoint selection uses the wrong criterion) I would recommend not just using the "best" model picked by
make training
, but implementing your own checkpoint selection based onmake traineddata
and subsequent (external, not lstmeval-based) CER measurement (on the validation subset) of each checkpoint.