tarrade / proj_multilingual_text_classification

Explore multilingal text classification using embedding, bert and deep learning architecture
Apache License 2.0
5 stars 2 forks source link

Reproducability of BERT Model Training Runs #59

Closed vluechinger closed 4 years ago

vluechinger commented 4 years ago

BERT fine-tuning does not always render the same outcomes for identical parameters due to two major reasons: random weight initializations and data orders (shuffling). Therefore, because of BERT's inherent structure, it is nearly impossible to create the same training run twice. This also means that the same specifications and parameters can lead to different results in the end.

This fact slightly complicates hyperparameter tuning since it is not clear whether a model improvement can be attributed to a better parameter or simply a better weight initialization or a better data order.

Tips to handle this:

The authors of the paper cited below also found that some weight initializations are globally better than others (for other tasks as well).

They also developed an algorithm to determine when to stop a model with a bad weight initialization which we could implement as well. (Although, they did not publish their code.)

This comment follows https://arxiv.org/pdf/2002.06305.pdf

tarrade commented 4 years ago

Yes, should be use everytime