This PR is for GMP pruning mixins, configs, and related functionality. It includes the following changes:
Enable fully dense but sparsifiable models at initialization
Updated wide tiny-bert experiments
Add script to create a prunable checkpoint (one with SparseWeight modules) of a densely trained model.
LR schedule for GMP pruning on a fully dense model.
The RezeroWeights callback is now configurable to log every log_steps instead of after every training step.
New mixins: GradualMagnitudePruningMixin and ThreeStageLRMixin
The last two mixins can be used together or independently depending on the use case. One may apply GMP pruning during pre-training or afterwards. If the latter, ThreeStageLRMixin should be used to enable lr phases of stabilization, pruning, and fine-tuning. Otherwise, a OneCycle LR or other schedule may be used for GMP throughout pre-training. As of now, there are experiments that try out both methods and it's an open question as to which leads to the best eval-loss.
This PR is for GMP pruning mixins, configs, and related functionality. It includes the following changes:
RezeroWeights
callback is now configurable to log everylog_steps
instead of after every training step.GradualMagnitudePruningMixin
andThreeStageLRMixin
The last two mixins can be used together or independently depending on the use case. One may apply GMP pruning during pre-training or afterwards. If the latter,
ThreeStageLRMixin
should be used to enable lr phases of stabilization, pruning, and fine-tuning. Otherwise, a OneCycle LR or other schedule may be used for GMP throughout pre-training. As of now, there are experiments that try out both methods and it's an open question as to which leads to the best eval-loss.