Wide BERT and Getting the Full 80% Sparsity

This PR includes experiments to train wider versions of BERT and those the get the full 80% sparsity.

Here's a table of results with Wide BERT	width	sparsity	on-params
1.0x	80%	850,510	3.578
1.25x	84.3%	842,474	3.514
1.50x	88%	865,227	3.461
2.0x	90.83%	843,781	3.469
4.875x	97.2%	834,243	3.438
4.875x (8 att. heads)	97%	918,441	3.409
Small BERT	96.95%	924,208	3.317

The wider architecture does seem to help, although given a comparable number if params, is doesn't do any better than Small BERT with a deeper architecture. Thus, there's still more to investigate.

Per the full 80% sparsity: There's a new param sparsify_all_embeddings. That sparsify's the word embeddings, as usual, but also the token and position embeddings. This is now the default, and old configs have been updated accordingly. Results show that sparsifying these additional embeddings actually helps accuracy on both pre-training and fine-tuning. As well, since the model has dense layer normalization params, we'll need to set the sparsity slightly higher than 80%. In the config, that will look something like sparsity=0.801.

numenta / nupic.research

Wide BERT and Getting the Full 80% Sparsity #514