Info on Deberta-v2-xlarge training infra

microsoft / DeBERTa

The implementation of DeBERTa

MIT License

1.91k stars 215 forks source link

Info on Deberta-v2-xlarge training infra #125

Open karthickgopalswamy opened 1 year ago

karthickgopalswamy commented 1 year ago

The paper talks about DeBERTa-large, base and DeBERTa1.5B model on V100 GPU. How is the DeBERTa-v2-xlarge trained? is the settings for the xlarge model same as used for large model in the paper? With DeBERTa-v2-xlarge having 900M parameters is any tensor parallelism used for training?