Analysis of performance of SWAG versus approximation rank (1a)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

I first ran a wide grid search over parameters with different magnitudes to get an idea for which optimization parameters would reach a good solution. The notebook is: https://colab.research.google.com/drive/1eCTHFa5vooirGJi9YdsXo080PoMCdb-d and the results of the grid search can be found here: https://drive.google.com/file/d/1-ZWK0jN8cLK8Cw9D7rK7OXz0XxXyWCqC/view?usp=sharing
I then trained a model with the best params: Notebook: https://colab.research.google.com/drive/110ae3O6WoMqlqpvwH8LnmyTTpoKzFjDr#scrollTo=Z5suM6SZQmEp
```
LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4
```
Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:

pretrained_training_grapah

Final pretrained model performance is:

::: Train :::
 {'loss': 0.034561568461060524, 'accuracy': 99.41}
::: Valid :::
 {'loss': 0.23624498672485353, 'accuracy': 92.41}
::: Test :::
 {'loss': 0.2568485828399658, 'accuracy': 92.28}

Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:

def schedule(lr_init, epoch, max_epochs):
    t = epoch / max_epochs
    lr_ratio = FINAL_LR / lr_init 
    if t <= 0.5:
        factor = 1.0
    elif t <= 0.9:
        factor = 1.0 - (1.0 - lr_ratio) * (t - 0.5) / 0.4
    else:
        factor = lr_ratio
    return lr_init * factor

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing

The model trained in the notebook above was trained with the following parameters:

SWA_LR = 0.005
SWA_MOMENTUM = 0.85
L2 = 1e-4
RANK = 30
SAMPLES_PER_EPOCH = 1
SAMPLE_FREQ = int((1/SAMPLES_PER_EPOCH)*len(train_set)/batch_size)
SAMPLING_CONDITION = lambda: True

I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution. SWA_training

I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model: BMASamplesVsPerformance

Looks weird. May want to investigate the code.

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).

The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing

What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):

SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging). swag_rank_v_performance_v1 )

I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank). SWA_swag_rank_v_performance_v1

tmgrgg / localvsglobaluncertainty

Experiment 1a (Simple Analysis of SWAG) #1

Analysis of performance of SWAG versus approximation rank (1a)

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).