Open tmgrgg opened 4 years ago
After fixing a bunch of issues and comparing my implementation against a run of the original author's implementation to ascertain its correctness, I am now confident that our implementation's agree and have obtained the following results which replace those above:
This is with 30 samples in the bayesian model averages and 50 "swag samples":
This is with 30 samples in the bayesian model averages and 150 "swag samples":
Much clearer improvement for including local uncertainty!
Analysis of performance of SWAG versus approximation rank (1a)
(Dataset: FashionMNIST, Model: DenseNet with depth 10)
I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.
1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.
Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:
Final pretrained model performance is:
Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:
2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).
The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing
The model trained in the notebook above was trained with the following parameters:
I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.
I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:
Looks weird. May want to investigate the code.
3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).
The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing
What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):
SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging). )
I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).