mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
673 stars 125 forks source link

[feature request] Reproducible trainings #580

Open stweil opened 3 months ago

stweil commented 3 months ago

Ideally, a training process should be reproducible, as this is required by good scientific practice.

Currently the kraken training is not reproducible. Two recognition trainings with the same ground truth and the same base model give different results (number of epochs, accuracies for the different intermediate models).

eScriptorium shuffles the ground truth randomly, but always uses the same seed, so the resulting training and validation sets are reproducible. But it looks like the training shuffles the training set once more, and that does not seem to be reproducible.

stweil commented 3 months ago

I just found my previous issue #302 for that. Is this an eScriporium issue which does not use the kraken API correctly?

mittagessen commented 3 months ago

On 24/03/17 02:16PM, Stefan Weil wrote:

I just found my previous issue #302 for that. Is this an eScriporium issue which does not use the kraken API correctly?

We never really tried to make eScriptorium reproducible and it is currently not possible to make training 100% reproducible because of cuda/cudnn limitations. You can try the deterministic training switch on ketos but you'll still see differences between machines/library versions/phase of the moon.

stweil commented 3 months ago

I currently struggle with eS trainings which end with a model which claims to have 100 % accuracy although all epochs show accuracies lower than 99 %. When I export the final model and examine its metadata, I can see that it is always the model from epoch 0 (eS starts counting the epochs with 0, so it's the result from the first epoch).

mittagessen commented 3 months ago

Hmm, you can set deterministic=warn on the KrakenTrainer object in eScriptorium which should eliminate most non-deterministic behavior but won't get rid of it completely. Shuffling the training data twice shouldn't really have an impact as the state of the RNG remains the same between two training runs (if you restart the workers). Otherwise we'd need to re-seed it for each task. IIRC CUDA CTC is always non-deterministic.