Reproducability of BERT Model Training Runs

BERT fine-tuning does not always render the same outcomes for identical parameters due to two major reasons: random weight initializations and data orders (shuffling). Therefore, because of BERT's inherent structure, it is nearly impossible to create the same training run twice. This also means that the same specifications and parameters can lead to different results in the end.

This fact slightly complicates hyperparameter tuning since it is not clear whether a model improvement can be attributed to a better parameter or simply a better weight initialization or a better data order.

Tips to handle this:

evaluate the models multiple times during an epoch
identify bad initializations early and stop them
calculate multiple runs of the same model and ensemble

The authors of the paper cited below also found that some weight initializations are globally better than others (for other tasks as well).

They also developed an algorithm to determine when to stop a model with a bad weight initialization which we could implement as well. (Although, they did not publish their code.)

This comment follows https://arxiv.org/pdf/2002.06305.pdf

tarrade / proj_multilingual_text_classification

Reproducability of BERT Model Training Runs #59