numenta / nupic.research

Experimental algorithms. Unsupported.
https://nupicresearch.readthedocs.io
GNU Affero General Public License v3.0
104 stars 60 forks source link

Wide BERT and Getting the Full 80% Sparsity #514

Closed mvacaporale closed 3 years ago

mvacaporale commented 3 years ago

This PR includes experiments to train wider versions of BERT and those the get the full 80% sparsity.

Here's a table of results with Wide BERT width sparsity on-params eval loss
1.0x 80% 850,510 3.578
1.25x 84.3% 842,474 3.514
1.50x 88% 865,227 3.461
2.0x 90.83% 843,781 3.469
4.875x 97.2% 834,243 3.438
4.875x (8 att. heads) 97% 918,441 3.409
Small BERT 96.95% 924,208 3.317

The wider architecture does seem to help, although given a comparable number if params, is doesn't do any better than Small BERT with a deeper architecture. Thus, there's still more to investigate.

Per the full 80% sparsity: There's a new param sparsify_all_embeddings. That sparsify's the word embeddings, as usual, but also the token and position embeddings. This is now the default, and old configs have been updated accordingly. Results show that sparsifying these additional embeddings actually helps accuracy on both pre-training and fine-tuning. As well, since the model has dense layer normalization params, we'll need to set the sparsity slightly higher than 80%. In the config, that will look something like sparsity=0.801.

benja-matic commented 3 years ago

Nice work, looks good to me. Duly noted re new param sparsify_all_embeddings.