OneCycle LR & RigL Experiments

RigL

The PR includes utilities for global pruning by RigL:

Methods to prune by weight and add by grad and associated tests
A CosineDecayPruneScheduler class for decaying the pruning rate throughout training
HF Trainer RigLCallback to integrate these utilities and train SparseWeight models using RigL
HF Trainer PlotDensitiesCallback to log the densities of each layer

Main Results:	model	train loss	eval loss	perplexity
tiny_bert_static_sparse_300k	4.537	3.997	54.432
tiny_bert_rigl_sparse_300k	5.963	5.774	321.95

^Note: These are sparse models.

OneCycleLR

This PR also includes mixins for using the OneCycleLR scheduler and a LR Range Test mixin to help find good bounds for the max and min learning rates. The results are below. The max_lr was tuned tuned manually to be 0.01 while all the other params go by the default for the scheduler.

model	train loss	eval loss	perplexity
tiny_bert_50k	5.990	5.800	330.234
tiny_bert_one_cycle_lr_50k	4.083	3.605	36.767

^Note: These are dense models.

TODO:

Ideally, the trainer_extra_kwargs will be logged to wandb so that the lr params therein get logged. I'm currently looking into this and will add it in a future PR.

numenta / nupic.research

OneCycle LR & RigL Experiments #489

RigL

OneCycleLR