This PR includes experiments to train wider versions of BERT and those the get the full 80% sparsity.
Here's a table of results with Wide BERT
width
sparsity
on-params
eval loss
1.0x
80%
850,510
3.578
1.25x
84.3%
842,474
3.514
1.50x
88%
865,227
3.461
2.0x
90.83%
843,781
3.469
4.875x
97.2%
834,243
3.438
4.875x (8 att. heads)
97%
918,441
3.409
Small BERT
96.95%
924,208
3.317
The wider architecture does seem to help, although given a comparable number if params, is doesn't do any better than Small BERT with a deeper architecture. Thus, there's still more to investigate.
Per the full 80% sparsity: There's a new param sparsify_all_embeddings. That sparsify's the word embeddings, as usual, but also the token and position embeddings. This is now the default, and old configs have been updated accordingly. Results show that sparsifying these additional embeddings actually helps accuracy on both pre-training and fine-tuning. As well, since the model has dense layer normalization params, we'll need to set the sparsity slightly higher than 80%. In the config, that will look something like sparsity=0.801.
This PR includes experiments to train wider versions of BERT and those the get the full 80% sparsity.
The wider architecture does seem to help, although given a comparable number if params, is doesn't do any better than Small BERT with a deeper architecture. Thus, there's still more to investigate.
Per the full 80% sparsity: There's a new param
sparsify_all_embeddings
. That sparsify's the word embeddings, as usual, but also the token and position embeddings. This is now the default, and old configs have been updated accordingly. Results show that sparsifying these additional embeddings actually helps accuracy on both pre-training and fine-tuning. As well, since the model has dense layer normalization params, we'll need to set the sparsity slightly higher than 80%. In the config, that will look something likesparsity=0.801
.