Sparse Transformers: new models, updated checkpoints, and new configs.

Per the new models

There are two new models that sparsify either the whole encoder or just the non-attention layers.
There's a class wrapper register_bert_model for Bert models that automatically creates a related BertConfig, BertForMaskedLM , and BertForSequenceClassification and then registers these classes with the Transformers library so they may be loaded by AutoConfig, AutoModelForMaskedLM, and AutoModelForSequenceClassification. More on this below.

Per the checkpoints:

I modified our previously made checkpoint in bert-steps_100k-sparsity_0.8 to use one of those new models so we can load it more easily. This model is in bert-steps_100k-sparsity_0.8_updated
I exported this checkpoint to /mnt/efs/results/pretrained-models/transformers-local/static_sparse_non_attention_bert_100k

Per the configs:

The config finetuning_sparse_bert_100k_glue finetunes that model and is in the updated README under sparse_v1_100k
Currently, I'm running static_sparse_encoder_bert_100k which sparsifies all layers, including attention. I'll update those results soon.

Example with register_bert_model:

@register_bert_model
class SparseBertModel(BertModel):

    @dataclass
    class ConfigKWargs:
        # Keyword arguments to configure sparsity.
        sparsity: float = 0.9

    # Define __init__, ect.
    ...

This will automatically create new classes called SparseBertConfig, SparseBertForMaskedLM, and SparseBertForSequenceClassification. Notice that the naming is automatic and is a function of the name of your original class. For instance, if you define DynamicSparseBertModel, you'd get a class named DynamicSparseBertConfig and so on.

As soon as you define the class, it's ready to autoload. For instance, you could do

config = AutoConfig.for_model(model_type="sparse_bert", sparsity=0.5)
model = AutoModelForMaskedLM.from_config(model)

type(model)
>>> SparseBertModelForMaskedLM

Notice how the model_type of "sparse_bert" has also been automatically formatted. In the other example, you'd use model_type="dynamic_sparse_bert". As well, the config is already equipped to accept the argument sparsity which can be accessed by your model. Thus, you can run

config.sparsity
>>> 0.5

This comes from the ConfigKWargs defined above. You can add whatever arguments you want to that dataclass. With this, we can modify our experiment configs and configure our models as desired.

numenta / nupic.research

Sparse Transformers: new models, updated checkpoints, and new configs. #470