tracel-ai / models

Models and examples built with Burn
Apache License 2.0
169 stars 23 forks source link

RoBERTa weights do not have encoders with norm_first: true #30

Closed seurimas closed 4 months ago

seurimas commented 5 months ago

Through tracing transformers running the same model on a fill-mask task, I was able to determine that the execution of transformers and bert-burn diverge at the point where normalization happens.

Furthermore, using the lm_head weights for roberta-base and attaching a LM head model, I was able to verify that bert-burn's results are correct with norm_first: false, but entirely wrong for norm_first: true.

I'd be happy to provide a pull request, but I'm not sure whether other BERT models do use norm_first: true. I'm very new to machine learning and am not familiar with this family of models.

nathanielsimard commented 4 months ago

Yeah I don't think Roberta actually uses norm_frist: true, maybe that was a mistake, happy to review your PR :)