ulb-sachsen-anhalt / ulb-zeitungsprojekt-hp1

Training data from "Hauptphase I" of project "Digitalisierung historischer deutscher Zeitungen"
12 stars 1 forks source link

do external model/checkpoint selection #1

Open bertsky opened 3 years ago

bertsky commented 3 years ago

In light of https://github.com/tesseract-ocr/tesseract/issues/3560 (which describes how not only tesstrain's own CER estimation is completely off but also why its checkpoint selection uses the wrong criterion) I would recommend not just using the "best" model picked by make training, but implementing your own checkpoint selection based on make traineddata and subsequent (external, not lstmeval-based) CER measurement (on the validation subset) of each checkpoint.

M3ssman commented 3 years ago

Thanks for your advises! We'll do some evaluation regarding this Issue, since we plan to utilize this model (or a model based on this workflow / training data) for current running digitalization of historical newspapers / "Zeitungsprojekt HP II"

bertsky commented 1 year ago

Since your report is published, may I inquire about model selection for ulbhdz1.traineddata again? Was the checkpoint selected by Tesseract already the best one with a true evaluator? How much did the CER results differ?