Pre-training times: v2 vs. v3

microsoft / DeBERTa

The implementation of DeBERTa

MIT License

1.91k stars 216 forks source link

Pre-training times: v2 vs. v3 #100

Open stefan-it opened 2 years ago

stefan-it commented 2 years ago

Hi,

it would be very interesting to also see a comparison of pre-training times for DeBERTa v2 versus the recently released v3, that is using RTD.

The v2 paper mentioned pre-training times:

But what about v3 base, large and multi-lingual models :thinking:

WissamAntoun commented 2 years ago

I was trying to pretrain DeBERTav2 with RTD objective (but without the Gradient-Disentangled Emb. sharing). I noticed that it runs way slower than electra (which is bert based).

I tried doing some quick benchmarking and noticed that deberta is twice as slow as bert for inference