Closed mvacaporale closed 3 years ago
This won't work with the distillation mixin. I am approving anyway so we can merge and move on, but this should be fixed before we can combine the mixins.
Thanks for catching this. I'll make RigL into a mixin, seems it's the only way. But I'll leave it for a soon-to-come PR. Otherwise, all of your other comments have been addressed.
RigL
The PR includes utilities for global pruning by RigL:
CosineDecayPruneScheduler
class for decaying the pruning rate throughout trainingRigLCallback
to integrate these utilities and train SparseWeight models using RigLPlotDensitiesCallback
to log the densities of each layer^Note: These are sparse models.
OneCycleLR
This PR also includes mixins for using the OneCycleLR scheduler and a LR Range Test mixin to help find good bounds for the max and min learning rates. The results are below. The max_lr was tuned tuned manually to be 0.01 while all the other params go by the default for the scheduler.
^Note: These are dense models.
TODO:
trainer_extra_kwargs
will be logged to wandb so that the lr params therein get logged. I'm currently looking into this and will add it in a future PR.