Through tracing transformers running the same model on a fill-mask task, I was able to determine that the execution of transformers and bert-burn diverge at the point where normalization happens.
Furthermore, using the lm_head weights for roberta-base and attaching a LM head model, I was able to verify that bert-burn's results are correct with norm_first: false, but entirely wrong for norm_first: true.
I'd be happy to provide a pull request, but I'm not sure whether other BERT models do use norm_first: true. I'm very new to machine learning and am not familiar with this family of models.
Through tracing
transformers
running the same model on a fill-mask task, I was able to determine that the execution oftransformers
andbert-burn
diverge at the point where normalization happens.Furthermore, using the lm_head weights for roberta-base and attaching a LM head model, I was able to verify that bert-burn's results are correct with
norm_first: false
, but entirely wrong fornorm_first: true
.I'd be happy to provide a pull request, but I'm not sure whether other BERT models do use
norm_first: true
. I'm very new to machine learning and am not familiar with this family of models.